-
-
Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-144157: Optimize bytes.translate() by deferring change detection #144158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-144157: Optimize bytes.translate() by deferring change detection #144158
Conversation
Move the equality check out of the hot loop to allow better compiler optimization. Instead of checking each byte during translation, perform a single memcmp at the end to determine if the input can be returned unchanged. This allows compilers to unroll and pipeline the loops, resulting in ~2x throughput improvement for medium-to-large inputs (tested on an AMD zen2). No change observed on small inputs. It will also be faster for bytes subclasses as those do not need change detection.
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop already ran *output++ = table_chars[c]; before the change, the change only moves the changed = 1 logic outside the loop.
|
May I ask you on what length do you test it ( |
I tested using 64 bytes - 256k as a microbenchmark using https://github.com/gpshead/cpython/blob/6d1b11ac1d84228f5ee7b5d4f3ab0c7fb77b7719/Tools/scripts/translate_bench.py#L454-L457 with --bytes_only. claude wrote that and I didn't spend much time looking it over, i'd have written it a bit differently myself to reduce overhead further given it's a microbenchmark, but it works and demonstrates the change and lack of tiny data regression regardless. skimming my data, the result was already a clear 10-15% improvement at 64 bytes and approached 2x as the size got larger on my zen2. i didn't spend time looking at the asm generated, but it makes sense in this case: that "changed" test was being done in the loop for every byte despite being something that only needs to short circuit evaluate. this way it is removed and the hot paths of translation and maybe change detection are both parallelizable memory streaming operations and change detection short circuit evaluates and exits the memcmp upon first changed byte. (thus an identity translation with no changes seeing a slightly lower performance gain than others) Roughly a 2x speedup for large inputs. For smaller inputs (64-127 bytes), the gains are more modest at 8-25% faster where the fixed overhead of the call dominates. I neglected to measure smaller than that, but I do not expect any meaningfully measurable regression. expand for a detailed table (x86_64 zen2 gcc 15.2)Other platforms? rerunning the benchmark on 32-bit raspbian (arm32) on a rpi5, there are still gains. I included smaller 8,20,32 sizes in this run. but the overall result is less impressive. 2%-30% at most for 64 bytes on up. slightly slower on the tiny sizes but close enough it could be in the noise. this lower spec arm probably doesn't pipeline as well or coalesce writes. and rerunning it on a 64-bit raspbian (arm64) rpi4 (wow those feel slow these days...), much better gains than the arm32 above. closer to what x86_64 zen2 saw. 10%-170% 64 bytes through 256k. insignificant for 32 bytes and below. |
|
@gpshead Whoa! Many thanks for such detailed answer! |
Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.
This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.
It will also be faster for bytes subclasses as those do not need change
detection.