gh-144157: Optimize bytes.translate() by deferring change detection #144158

gpshead · 2026-01-22T16:21:59Z

Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.

This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.

It will also be faster for bytes subclasses as those do not need change
detection.

Issue: optimize bytes.translate by letting the compiler unroll the loop more usefully #144157

Move the equality check out of the hot loop to allow better compiler optimization. Instead of checking each byte during translation, perform a single memcmp at the end to determine if the input can be returned unchanged. This allows compilers to unroll and pipeline the loops, resulting in ~2x throughput improvement for medium-to-large inputs (tested on an AMD zen2). No change observed on small inputs. It will also be faster for bytes subclasses as those do not need change detection.

vstinner

LGTM

vstinner

The loop already ran *output++ = table_chars[c]; before the change, the change only moves the changed = 1 logic outside the loop.

sergey-miryanov · 2026-01-22T20:51:26Z

May I ask you on what length do you test it (medium-to-large inputs)?

gpshead · 2026-01-22T21:41:07Z

May I ask you on what length do you test it (medium-to-large inputs)?

I tested using 64 bytes - 256k as a microbenchmark using https://github.com/gpshead/cpython/blob/6d1b11ac1d84228f5ee7b5d4f3ab0c7fb77b7719/Tools/scripts/translate_bench.py#L454-L457 with --bytes_only. claude wrote that and I didn't spend much time looking it over, i'd have written it a bit differently myself to reduce overhead further given it's a microbenchmark, but it works and demonstrates the change and lack of tiny data regression regardless.

skimming my data, the result was already a clear 10-15% improvement at 64 bytes and approached 2x as the size got larger on my zen2.

i didn't spend time looking at the asm generated, but it makes sense in this case: that "changed" test was being done in the loop for every byte despite being something that only needs to short circuit evaluate. this way it is removed and the hot paths of translation and maybe change detection are both parallelizable memory streaming operations and change detection short circuit evaluates and exits the memcmp upon first changed byte. (thus an identity translation with no changes seeing a slightly lower performance gain than others)

Roughly a 2x speedup for large inputs. For smaller inputs (64-127 bytes), the gains are more modest at 8-25% faster where the fixed overhead of the call dominates. I neglected to measure smaller than that, but I do not expect any meaningfully measurable regression.

expand for a detailed table (x86_64 zen2 gcc 15.2)

bytes: nibble swap (no del)                                                       |bytes: nibble swap (no del)
------------------------------------------------------------                      |------------------------------------------------------------
      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len
------------------------------------------------------------                      |------------------------------------------------------------
        64         88.2       0.73         64                                     |        64         95.9       0.67         64
       100        106.2       0.94        100                                     |       100        127.8       0.78        100
       127        109.5       1.16        127                                     |       127        142.2       0.89        127
       256        147.2       1.74        256                                     |       256        221.1       1.16        256
       500        230.3       2.17        500                                     |       500        373.0       1.34        500
      1000        427.1       2.34       1000                                     |      1000        723.8       1.38       1000
      1024        441.6       2.32       1024                                     |      1024        729.2       1.40       1024
      4096       1324.8       3.09       4096                                     |      4096       2590.0       1.58       4096
     16384       4978.0       3.29      16384                                     |     16384       9922.2       1.65      16384
     65536      19391.2       3.38      65536                                     |     65536      39363.0       1.66      65536
    262144      77818.4       3.37     262144                                     |    262144     155573.2       1.69     262144
                                                                                  |
bytes: identity (no del)                                                          |bytes: identity (no del)
------------------------------------------------------------                      |------------------------------------------------------------
      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len
------------------------------------------------------------                      |------------------------------------------------------------
        64         89.2       0.72         64                                     |        64         98.8       0.65         64
       100        109.5       0.91        100                                     |       100        131.9       0.76        100
       127        110.4       1.15        127                                     |       127        146.3       0.87        127
       256        145.8       1.76        256                                     |       256        232.9       1.10        256
       500        234.8       2.13        500                                     |       500        380.8       1.31        500
      1000        421.6       2.37       1000                                     |      1000        706.9       1.41       1000
      1024        433.8       2.36       1024                                     |      1024        724.9       1.41       1024
      4096       1335.4       3.07       4096                                     |      4096       2526.7       1.62       4096
     16384       5109.1       3.21      16384                                     |     16384       9808.6       1.67      16384
     65536      20334.0       3.22      65536                                     |     65536      39629.5       1.65      65536
    262144      82627.2       3.17     262144                                     |    262144     156007.6       1.68     262144

Other platforms?

rerunning the benchmark on 32-bit raspbian (arm32) on a rpi5, there are still gains. I included smaller 8,20,32 sizes in this run. but the overall result is less impressive. 2%-30% at most for 64 bytes on up. slightly slower on the tiny sizes but close enough it could be in the noise. this lower spec arm probably doesn't pipeline as well or coalesce writes.

and rerunning it on a 64-bit raspbian (arm64) rpi4 (wow those feel slow these days...), much better gains than the arm32 above. closer to what x86_64 zen2 saw. 10%-170% 64 bytes through 256k. insignificant for 32 bytes and below.

sergey-miryanov · 2026-01-23T06:20:51Z

@gpshead Whoa! Many thanks for such detailed answer!

gpshead added 2 commits January 22, 2026 16:08

NEWS entry

46c1623

bedevere-app bot mentioned this pull request Jan 22, 2026

optimize bytes.translate by letting the compiler unroll the loop more usefully #144157

Closed

bedevere-app bot added the awaiting core review label Jan 22, 2026

gpshead requested a review from vstinner January 22, 2026 16:23

vstinner approved these changes Jan 22, 2026

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Jan 22, 2026

vstinner reviewed Jan 22, 2026

View reviewed changes

gpshead merged commit a966d94 into python:main Jan 22, 2026
53 checks passed

bedevere-app bot removed the awaiting merge label Jan 22, 2026

gpshead self-assigned this Jan 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-144157: Optimize bytes.translate() by deferring change detection #144158

gh-144157: Optimize bytes.translate() by deferring change detection #144158

gpshead commented Jan 22, 2026 •

edited by bedevere-app bot

Loading

Uh oh!

vstinner left a comment

Uh oh!

vstinner left a comment

Uh oh!

Uh oh!

sergey-miryanov commented Jan 22, 2026

Uh oh!

gpshead commented Jan 22, 2026

Uh oh!

sergey-miryanov commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-144157: Optimize bytes.translate() by deferring change detection #144158

gh-144157: Optimize bytes.translate() by deferring change detection #144158

Conversation

gpshead commented Jan 22, 2026 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergey-miryanov commented Jan 22, 2026

Uh oh!

gpshead commented Jan 22, 2026

Uh oh!

sergey-miryanov commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gpshead commented Jan 22, 2026 •

edited by bedevere-app bot

Loading