Skip to content

Conversation

@gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Jan 12, 2026

Which issue does this PR close?

It does not close any issue, but it's related to:

Rationale for this change

This is a PR from a batch of PRs that attempt to improve performance in hash joins:

It adds the new BufferExec node at the top of the probe side of hash joins so that some work is eagerly performed before the build side of the hash join is completely finished.

Why should this speed up joins?

In order to better understand the impact of this PR, it's useful to understand how streams work in Rust: creating a stream does not perform any work, progress is just made if the stream gets polled.

This means that whenever we call .execute() on an ExecutionPlan (like the probe side of a join), nothing happens, not even the most basic TCP connections or system calls are performed. Instead, all this work is delayed as much as possible until the first poll is made to the stream, losing the opportunity to make some early progress.

This gets worst when multiple hash joins are chained together: they will get executed in cascade as if they were domino pieces, which has the benefit of leaving a small memory footprint, but underutilizes the resources of the machine for executing the query faster.

Note

Even if this shows overall performance improvement in the benchmarks, it can show performance degradation on queries with dynamic filters, so hash join buffering is disabled by default, and users can opt in.
Follow up work will be needed in order to make this interact well with dynamic filters.

What changes are included in this PR?

Adds a new HashJoinBuffering physical optimizer rule that will idempotently place BufferExec nodes on the probe side of has joins:

            ┌───────────────────┐
            │   HashJoinExec    │
            └─────▲────────▲────┘
          ┌───────┘        └─────────┐
          │                          │
 ┌────────────────┐         ┌─────────────────┐
 │   Build side   │       + │   BufferExec    │
 └────────────────┘         └────────▲────────┘
                                     │
                            ┌────────┴────────┐
                            │   Probe side    │
                            └─────────────────┘

Are these changes tested?

yes, by existing tests

Are there any user-facing changes?

Not by default, users can now opt in to this feature with the hash_join_buffering_capacity config parameter.


Results

Note

Note that a small number of TPC-DS queries have regressed, this is because with eager buffering they do not benefit from dynamic filters as much. This is the main reason for leaving this config parameter disabled by default until we have a proper way for interacting with the dynamic filters inside the BufferExec node.

./bench.sh compare main hash-join-buffering-on-probe-side
Comparing main and hash-join-buffering-on-probe-side
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃       main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │   39.70 ms │                          18.30 ms │ +2.17x faster │
│ QQuery 2  │  134.95 ms │                          57.64 ms │ +2.34x faster │
│ QQuery 3  │  103.08 ms │                          89.88 ms │ +1.15x faster │
│ QQuery 4  │ 1034.04 ms │                         340.84 ms │ +3.03x faster │
│ QQuery 5  │  155.97 ms │                         139.44 ms │ +1.12x faster │
│ QQuery 6  │  591.97 ms │                         523.17 ms │ +1.13x faster │
│ QQuery 7  │  304.47 ms │                         235.52 ms │ +1.29x faster │
│ QQuery 8  │  100.02 ms │                          90.65 ms │ +1.10x faster │
│ QQuery 9  │   91.86 ms │                          92.25 ms │     no change │
│ QQuery 10 │   90.97 ms │                          47.59 ms │ +1.91x faster │
│ QQuery 11 │  649.77 ms │                         217.42 ms │ +2.99x faster │
│ QQuery 12 │   38.12 ms │                          31.71 ms │ +1.20x faster │
│ QQuery 13 │  337.93 ms │                         302.24 ms │ +1.12x faster │
│ QQuery 14 │  797.93 ms │                         439.40 ms │ +1.82x faster │
│ QQuery 15 │   28.14 ms │                          47.75 ms │  1.70x slower │
│ QQuery 16 │   34.08 ms │                         103.14 ms │  3.03x slower │
│ QQuery 17 │  225.43 ms │                         152.29 ms │ +1.48x faster │
│ QQuery 18 │  118.09 ms │                         164.11 ms │  1.39x slower │
│ QQuery 19 │  143.39 ms │                         120.07 ms │ +1.19x faster │
│ QQuery 20 │   12.70 ms │                          52.15 ms │  4.11x slower │
│ QQuery 21 │   16.74 ms │                         184.55 ms │ 11.02x slower │
│ QQuery 22 │  311.97 ms │                         358.70 ms │  1.15x slower │
│ QQuery 23 │  807.41 ms │                         531.22 ms │ +1.52x faster │
│ QQuery 24 │  347.90 ms │                         279.01 ms │ +1.25x faster │
│ QQuery 25 │  313.20 ms │                         183.26 ms │ +1.71x faster │
│ QQuery 26 │   83.57 ms │                         124.28 ms │  1.49x slower │
│ QQuery 27 │  300.93 ms │                         237.28 ms │ +1.27x faster │
│ QQuery 28 │  130.79 ms │                         129.64 ms │     no change │
│ QQuery 29 │  267.08 ms │                         157.55 ms │ +1.70x faster │
│ QQuery 30 │   37.23 ms │                          25.98 ms │ +1.43x faster │
│ QQuery 31 │  128.57 ms │                         102.96 ms │ +1.25x faster │
│ QQuery 32 │   50.16 ms │                          42.77 ms │ +1.17x faster │
│ QQuery 33 │  114.06 ms │                         110.83 ms │     no change │
│ QQuery 34 │   89.27 ms │                          77.19 ms │ +1.16x faster │
│ QQuery 35 │   86.66 ms │                          50.86 ms │ +1.70x faster │
│ QQuery 36 │  173.00 ms │                         160.46 ms │ +1.08x faster │
│ QQuery 37 │  157.69 ms │                         153.57 ms │     no change │
│ QQuery 38 │   62.53 ms │                          52.28 ms │ +1.20x faster │
│ QQuery 39 │   83.38 ms │                         394.28 ms │  4.73x slower │
│ QQuery 40 │   87.64 ms │                          77.15 ms │ +1.14x faster │
│ QQuery 41 │   16.23 ms │                          15.05 ms │ +1.08x faster │
│ QQuery 42 │   93.24 ms │                          88.03 ms │ +1.06x faster │
│ QQuery 43 │   72.64 ms │                          63.49 ms │ +1.14x faster │
│ QQuery 44 │    9.06 ms │                           7.80 ms │ +1.16x faster │
│ QQuery 45 │   55.46 ms │                          34.12 ms │ +1.63x faster │
│ QQuery 46 │  185.75 ms │                         163.09 ms │ +1.14x faster │
│ QQuery 47 │  529.01 ms │                         143.05 ms │ +3.70x faster │
│ QQuery 48 │  236.59 ms │                         198.08 ms │ +1.19x faster │
│ QQuery 49 │  208.83 ms │                         191.07 ms │ +1.09x faster │
│ QQuery 50 │  176.04 ms │                         143.57 ms │ +1.23x faster │
│ QQuery 51 │  140.97 ms │                          96.36 ms │ +1.46x faster │
│ QQuery 52 │   92.83 ms │                          86.68 ms │ +1.07x faster │
│ QQuery 53 │   90.46 ms │                          83.34 ms │ +1.09x faster │
│ QQuery 54 │  135.74 ms │                         116.89 ms │ +1.16x faster │
│ QQuery 55 │   91.55 ms │                          87.18 ms │     no change │
│ QQuery 56 │  113.12 ms │                         111.00 ms │     no change │
│ QQuery 57 │  129.43 ms │                          78.69 ms │ +1.64x faster │
│ QQuery 58 │  229.68 ms │                         165.27 ms │ +1.39x faster │
│ QQuery 59 │  161.24 ms │                         125.57 ms │ +1.28x faster │
│ QQuery 60 │  116.86 ms │                         111.38 ms │     no change │
│ QQuery 61 │  150.19 ms │                         143.00 ms │     no change │
│ QQuery 62 │  426.70 ms │                         413.02 ms │     no change │
│ QQuery 63 │   93.41 ms │                          81.94 ms │ +1.14x faster │
│ QQuery 64 │  578.51 ms │                         442.41 ms │ +1.31x faster │
│ QQuery 65 │  201.75 ms │                          87.46 ms │ +2.31x faster │
│ QQuery 66 │  181.57 ms │                         184.28 ms │     no change │
│ QQuery 67 │  246.39 ms │                         226.38 ms │ +1.09x faster │
│ QQuery 68 │  230.40 ms │                         212.41 ms │ +1.08x faster │
│ QQuery 69 │   91.30 ms │                          46.05 ms │ +1.98x faster │
│ QQuery 70 │  270.46 ms │                         232.65 ms │ +1.16x faster │
│ QQuery 71 │  111.93 ms │                         107.35 ms │     no change │
│ QQuery 72 │  562.16 ms │                         435.56 ms │ +1.29x faster │
│ QQuery 73 │   85.66 ms │                          81.05 ms │ +1.06x faster │
│ QQuery 74 │  371.14 ms │                         148.67 ms │ +2.50x faster │
│ QQuery 75 │  221.61 ms │                         170.13 ms │ +1.30x faster │
│ QQuery 76 │  122.88 ms │                         107.23 ms │ +1.15x faster │
│ QQuery 77 │  163.52 ms │                         140.98 ms │ +1.16x faster │
│ QQuery 78 │  313.92 ms │                         205.72 ms │ +1.53x faster │
│ QQuery 79 │  187.31 ms │                         163.73 ms │ +1.14x faster │
│ QQuery 80 │  262.86 ms │                         240.16 ms │ +1.09x faster │
│ QQuery 81 │   21.89 ms │                          18.13 ms │ +1.21x faster │
│ QQuery 82 │  172.40 ms │                         159.50 ms │ +1.08x faster │
│ QQuery 83 │   45.38 ms │                          22.20 ms │ +2.04x faster │
│ QQuery 84 │   39.41 ms │                          30.58 ms │ +1.29x faster │
│ QQuery 85 │  141.02 ms │                          73.78 ms │ +1.91x faster │
│ QQuery 86 │   31.68 ms │                          28.44 ms │ +1.11x faster │
│ QQuery 87 │   63.57 ms │                          53.78 ms │ +1.18x faster │
│ QQuery 88 │   88.01 ms │                          74.09 ms │ +1.19x faster │
│ QQuery 89 │  105.44 ms │                          85.16 ms │ +1.24x faster │
│ QQuery 90 │   19.76 ms │                          16.27 ms │ +1.21x faster │
│ QQuery 91 │   52.31 ms │                          33.46 ms │ +1.56x faster │
│ QQuery 92 │   50.04 ms │                          25.38 ms │ +1.97x faster │
│ QQuery 93 │  143.48 ms │                         130.62 ms │ +1.10x faster │
│ QQuery 94 │   50.84 ms │                          45.57 ms │ +1.12x faster │
│ QQuery 95 │  131.03 ms │                          57.60 ms │ +2.27x faster │
│ QQuery 96 │   60.62 ms │                          53.24 ms │ +1.14x faster │
│ QQuery 97 │   95.60 ms │                          67.33 ms │ +1.42x faster │
│ QQuery 98 │  125.88 ms │                         103.03 ms │ +1.22x faster │
│ QQuery 99 │ 4475.55 ms │                        4459.77 ms │     no change │
└───────────┴────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                                │ 22354.66ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 18217.21ms │
│ Average Time (main)                              │   225.80ms │
│ Average Time (hash-join-buffering-on-probe-side) │   184.01ms │
│ Queries Faster                                   │         79 │
│ Queries Slower                                   │          8 │
│ Queries with No Change                           │         12 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │  42.94 ms │                          45.96 ms │  1.07x slower │
│ QQuery 2  │  20.64 ms │                          13.00 ms │ +1.59x faster │
│ QQuery 3  │  30.07 ms │                          24.52 ms │ +1.23x faster │
│ QQuery 4  │  17.22 ms │                          16.39 ms │     no change │
│ QQuery 5  │  98.91 ms │                          41.25 ms │ +2.40x faster │
│ QQuery 6  │  18.67 ms │                          18.23 ms │     no change │
│ QQuery 7  │ 104.82 ms │                          46.45 ms │ +2.26x faster │
│ QQuery 8  │  97.98 ms │                          34.09 ms │ +2.87x faster │
│ QQuery 9  │  86.25 ms │                          43.32 ms │ +1.99x faster │
│ QQuery 10 │ 106.09 ms │                          41.49 ms │ +2.56x faster │
│ QQuery 11 │  13.77 ms │                          11.15 ms │ +1.24x faster │
│ QQuery 12 │  54.57 ms │                          30.04 ms │ +1.82x faster │
│ QQuery 13 │  21.71 ms │                          21.74 ms │     no change │
│ QQuery 14 │  51.38 ms │                          21.87 ms │ +2.35x faster │
│ QQuery 15 │  35.13 ms │                          27.95 ms │ +1.26x faster │
│ QQuery 16 │  13.04 ms │                          12.05 ms │ +1.08x faster │
│ QQuery 17 │  82.94 ms │                          53.05 ms │ +1.56x faster │
│ QQuery 18 │ 109.92 ms │                          61.16 ms │ +1.80x faster │
│ QQuery 19 │  37.57 ms │                          37.79 ms │     no change │
│ QQuery 20 │  60.77 ms │                          26.21 ms │ +2.32x faster │
│ QQuery 21 │  78.28 ms │                          54.11 ms │ +1.45x faster │
│ QQuery 22 │   8.48 ms │                           8.73 ms │     no change │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                                │ 1191.17ms │
│ Total Time (hash-join-buffering-on-probe-side)   │  690.57ms │
│ Average Time (main)                              │   54.14ms │
│ Average Time (hash-join-buffering-on-probe-side) │   31.39ms │
│ Queries Faster                                   │        16 │
│ Queries Slower                                   │         1 │
│ Queries with No Change                           │         5 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 373.55 ms │                         331.39 ms │ +1.13x faster │
│ QQuery 2  │ 150.76 ms │                          89.50 ms │ +1.68x faster │
│ QQuery 3  │ 320.66 ms │                         271.20 ms │ +1.18x faster │
│ QQuery 4  │ 132.21 ms │                         115.69 ms │ +1.14x faster │
│ QQuery 5  │ 462.29 ms │                         401.73 ms │ +1.15x faster │
│ QQuery 6  │ 150.85 ms │                         124.88 ms │ +1.21x faster │
│ QQuery 7  │ 631.13 ms │                         547.52 ms │ +1.15x faster │
│ QQuery 8  │ 593.22 ms │                         445.85 ms │ +1.33x faster │
│ QQuery 9  │ 780.22 ms │                         657.02 ms │ +1.19x faster │
│ QQuery 10 │ 495.89 ms │                         324.21 ms │ +1.53x faster │
│ QQuery 11 │ 144.90 ms │                          88.97 ms │ +1.63x faster │
│ QQuery 12 │ 263.50 ms │                         188.59 ms │ +1.40x faster │
│ QQuery 13 │ 287.01 ms │                         217.33 ms │ +1.32x faster │
│ QQuery 14 │ 248.88 ms │                         166.86 ms │ +1.49x faster │
│ QQuery 15 │ 399.52 ms │                         280.62 ms │ +1.42x faster │
│ QQuery 16 │  97.62 ms │                          65.14 ms │ +1.50x faster │
│ QQuery 17 │ 780.01 ms │                         641.17 ms │ +1.22x faster │
│ QQuery 18 │ 824.42 ms │                         696.09 ms │ +1.18x faster │
│ QQuery 19 │ 367.17 ms │                         268.54 ms │ +1.37x faster │
│ QQuery 20 │ 332.86 ms │                         241.19 ms │ +1.38x faster │
│ QQuery 21 │ 856.49 ms │                         697.65 ms │ +1.23x faster │
│ QQuery 22 │  89.72 ms │                          72.73 ms │ +1.23x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                                │ 8782.89ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 6933.87ms │
│ Average Time (main)                              │  399.22ms │
│ Average Time (hash-join-buffering-on-probe-side) │  315.18ms │
│ Queries Faster                                   │        22 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │         0 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Jan 12, 2026
@gabotechs
Copy link
Contributor Author

run benchmarks

@alamb-ghbot
Copy link

🤖 Hi @gabotechs, thanks for the request (#19761 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Omega359, adriangb, alamb, comphead, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026
@gabotechs
Copy link
Contributor Author

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │  2479.50 ms │                        2365.91 ms │     no change │
│ QQuery 1 │   933.04 ms │                         961.61 ms │     no change │
│ QQuery 2 │  2128.72 ms │                        1828.41 ms │ +1.16x faster │
│ QQuery 3 │  1140.67 ms │                        1106.77 ms │     no change │
│ QQuery 4 │  2349.73 ms │                        2265.79 ms │     no change │
│ QQuery 5 │ 28477.94 ms │                       27819.90 ms │     no change │
│ QQuery 6 │  3913.85 ms │                        3886.72 ms │     no change │
│ QQuery 7 │  2907.17 ms │                        2857.38 ms │     no change │
└──────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 44330.62ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 43092.50ms │
│ Average Time (HEAD)                              │  5541.33ms │
│ Average Time (hash-join-buffering-on-probe-side) │  5386.56ms │
│ Queries Faster                                   │          1 │
│ Queries Slower                                   │          0 │
│ Queries with No Change                           │          7 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.91 ms │                           1.94 ms │     no change │
│ QQuery 1  │    50.86 ms │                          51.03 ms │     no change │
│ QQuery 2  │   129.07 ms │                         131.06 ms │     no change │
│ QQuery 3  │   151.75 ms │                         154.89 ms │     no change │
│ QQuery 4  │  1070.04 ms │                        1218.71 ms │  1.14x slower │
│ QQuery 5  │  1377.65 ms │                        1501.78 ms │  1.09x slower │
│ QQuery 6  │     1.82 ms │                           1.87 ms │     no change │
│ QQuery 7  │    56.03 ms │                          61.22 ms │  1.09x slower │
│ QQuery 8  │  1423.84 ms │                        1561.18 ms │  1.10x slower │
│ QQuery 9  │  1748.54 ms │                        1871.82 ms │  1.07x slower │
│ QQuery 10 │   343.11 ms │                         350.58 ms │     no change │
│ QQuery 11 │   390.93 ms │                         400.26 ms │     no change │
│ QQuery 12 │  1249.28 ms │                        1460.10 ms │  1.17x slower │
│ QQuery 13 │  1916.12 ms │                        2067.22 ms │  1.08x slower │
│ QQuery 14 │  1214.64 ms │                        1359.01 ms │  1.12x slower │
│ QQuery 15 │  1224.35 ms │                        1382.17 ms │  1.13x slower │
│ QQuery 16 │  2587.35 ms │                        2651.10 ms │     no change │
│ QQuery 17 │  2481.42 ms │                        2645.83 ms │  1.07x slower │
│ QQuery 18 │  6019.63 ms │                        4969.84 ms │ +1.21x faster │
│ QQuery 19 │   118.04 ms │                         122.91 ms │     no change │
│ QQuery 20 │  1977.36 ms │                        1907.42 ms │     no change │
│ QQuery 21 │  2282.79 ms │                        2227.74 ms │     no change │
│ QQuery 22 │  4147.94 ms │                        3809.68 ms │ +1.09x faster │
│ QQuery 23 │ 18037.69 ms │                       12405.70 ms │ +1.45x faster │
│ QQuery 24 │   203.52 ms │                         236.74 ms │  1.16x slower │
│ QQuery 25 │   482.62 ms │                         517.70 ms │  1.07x slower │
│ QQuery 26 │   218.15 ms │                         233.60 ms │  1.07x slower │
│ QQuery 27 │  2805.96 ms │                        2772.92 ms │     no change │
│ QQuery 28 │ 22174.76 ms │                       21847.75 ms │     no change │
│ QQuery 29 │   977.94 ms │                         952.08 ms │     no change │
│ QQuery 30 │  1315.68 ms │                        1336.40 ms │     no change │
│ QQuery 31 │  1366.16 ms │                        1421.09 ms │     no change │
│ QQuery 32 │  5155.78 ms │                        4350.28 ms │ +1.19x faster │
│ QQuery 33 │  5715.37 ms │                        5687.38 ms │     no change │
│ QQuery 34 │  6016.46 ms │                        5853.35 ms │     no change │
│ QQuery 35 │  1918.61 ms │                        2098.03 ms │  1.09x slower │
│ QQuery 36 │    67.22 ms │                          70.35 ms │     no change │
│ QQuery 37 │    45.47 ms │                          49.54 ms │  1.09x slower │
│ QQuery 38 │    65.42 ms │                          68.24 ms │     no change │
│ QQuery 39 │   104.44 ms │                         111.22 ms │  1.06x slower │
│ QQuery 40 │    27.46 ms │                          27.62 ms │     no change │
│ QQuery 41 │    23.04 ms │                          24.38 ms │  1.06x slower │
│ QQuery 42 │    19.89 ms │                          21.71 ms │  1.09x slower │
└───────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 98706.12ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 91995.42ms │
│ Average Time (HEAD)                              │  2295.49ms │
│ Average Time (hash-join-buffering-on-probe-side) │  2139.43ms │
│ Queries Faster                                   │          4 │
│ Queries Slower                                   │         18 │
│ Queries with No Change                           │         21 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 140.65 ms │                         101.97 ms │ +1.38x faster │
│ QQuery 2  │  37.21 ms │                          30.95 ms │ +1.20x faster │
│ QQuery 3  │  44.92 ms │                          32.31 ms │ +1.39x faster │
│ QQuery 4  │  31.87 ms │                          30.19 ms │ +1.06x faster │
│ QQuery 5  │  92.53 ms │                          94.55 ms │     no change │
│ QQuery 6  │  21.01 ms │                          20.99 ms │     no change │
│ QQuery 7  │ 157.97 ms │                         165.53 ms │     no change │
│ QQuery 8  │  41.01 ms │                          35.06 ms │ +1.17x faster │
│ QQuery 9  │ 102.50 ms │                          93.90 ms │ +1.09x faster │
│ QQuery 10 │  68.82 ms │                          67.90 ms │     no change │
│ QQuery 11 │  19.57 ms │                          17.92 ms │ +1.09x faster │
│ QQuery 12 │  52.47 ms │                          54.41 ms │     no change │
│ QQuery 13 │  50.52 ms │                          47.74 ms │ +1.06x faster │
│ QQuery 14 │  15.26 ms │                          15.25 ms │     no change │
│ QQuery 15 │  31.19 ms │                          30.51 ms │     no change │
│ QQuery 16 │  30.26 ms │                          28.22 ms │ +1.07x faster │
│ QQuery 17 │ 144.19 ms │                         150.21 ms │     no change │
│ QQuery 18 │ 286.83 ms │                         262.07 ms │ +1.09x faster │
│ QQuery 19 │  40.60 ms │                          41.31 ms │     no change │
│ QQuery 20 │  57.30 ms │                          56.06 ms │     no change │
│ QQuery 21 │ 188.92 ms │                         179.62 ms │     no change │
│ QQuery 22 │  22.42 ms │                          22.15 ms │     no change │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 1678.03ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 1578.79ms │
│ Average Time (HEAD)                              │   76.27ms │
│ Average Time (hash-join-buffering-on-probe-side) │   71.76ms │
│ Queries Faster                                   │        10 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │        12 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpcds tpch10

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 1.

Last 10 lines of output:

Click to expand
BRANCH_NAME: HEAD
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

@gabotechs
Copy link
Contributor Author

run benchmark tpch10

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch10
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query     ┃ HEAD ┃ hash-join-buffering-on-probe-side ┃       Change ┃
┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1  │ FAIL │                              FAIL │ incomparable │
│ QQuery 2  │ FAIL │                              FAIL │ incomparable │
│ QQuery 3  │ FAIL │                              FAIL │ incomparable │
│ QQuery 4  │ FAIL │                              FAIL │ incomparable │
│ QQuery 5  │ FAIL │                              FAIL │ incomparable │
│ QQuery 6  │ FAIL │                              FAIL │ incomparable │
│ QQuery 7  │ FAIL │                              FAIL │ incomparable │
│ QQuery 8  │ FAIL │                              FAIL │ incomparable │
│ QQuery 9  │ FAIL │                              FAIL │ incomparable │
│ QQuery 10 │ FAIL │                              FAIL │ incomparable │
│ QQuery 11 │ FAIL │                              FAIL │ incomparable │
│ QQuery 12 │ FAIL │                              FAIL │ incomparable │
│ QQuery 13 │ FAIL │                              FAIL │ incomparable │
│ QQuery 14 │ FAIL │                              FAIL │ incomparable │
│ QQuery 15 │ FAIL │                              FAIL │ incomparable │
│ QQuery 16 │ FAIL │                              FAIL │ incomparable │
│ QQuery 17 │ FAIL │                              FAIL │ incomparable │
│ QQuery 18 │ FAIL │                              FAIL │ incomparable │
│ QQuery 19 │ FAIL │                              FAIL │ incomparable │
│ QQuery 20 │ FAIL │                              FAIL │ incomparable │
│ QQuery 21 │ FAIL │                              FAIL │ incomparable │
│ QQuery 22 │ FAIL │                              FAIL │ incomparable │
└───────────┴──────┴───────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                                ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                                │ 0.00ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 0.00ms │
│ Average Time (HEAD)                              │ 0.00ms │
│ Average Time (hash-join-buffering-on-probe-side) │ 0.00ms │
│ Queries Faster                                   │      0 │
│ Queries Slower                                   │      0 │
│ Queries with No Change                           │      0 │
│ Queries with Failure                             │     22 │
└──────────────────────────────────────────────────┴────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpch

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 186.54 ms │                         180.81 ms │     no change │
│ QQuery 2  │  92.79 ms │                          48.71 ms │ +1.90x faster │
│ QQuery 3  │ 129.28 ms │                         106.07 ms │ +1.22x faster │
│ QQuery 4  │  80.78 ms │                          74.64 ms │ +1.08x faster │
│ QQuery 5  │ 186.74 ms │                         163.71 ms │ +1.14x faster │
│ QQuery 6  │  70.54 ms │                          66.87 ms │ +1.06x faster │
│ QQuery 7  │ 222.50 ms │                         194.54 ms │ +1.14x faster │
│ QQuery 8  │ 175.16 ms │                         125.23 ms │ +1.40x faster │
│ QQuery 9  │ 231.17 ms │                         174.24 ms │ +1.33x faster │
│ QQuery 10 │ 190.18 ms │                         148.84 ms │ +1.28x faster │
│ QQuery 11 │  70.01 ms │                          46.31 ms │ +1.51x faster │
│ QQuery 12 │ 120.18 ms │                         109.09 ms │ +1.10x faster │
│ QQuery 13 │ 219.34 ms │                         204.01 ms │ +1.08x faster │
│ QQuery 14 │  95.98 ms │                          88.23 ms │ +1.09x faster │
│ QQuery 15 │ 132.46 ms │                         100.40 ms │ +1.32x faster │
│ QQuery 16 │  64.09 ms │                          46.41 ms │ +1.38x faster │
│ QQuery 17 │ 280.98 ms │                         211.97 ms │ +1.33x faster │
│ QQuery 18 │ 332.62 ms │                         271.65 ms │ +1.22x faster │
│ QQuery 19 │ 140.44 ms │                         130.87 ms │ +1.07x faster │
│ QQuery 20 │ 135.30 ms │                         100.57 ms │ +1.35x faster │
│ QQuery 21 │ 265.90 ms │                         234.12 ms │ +1.14x faster │
│ QQuery 22 │  41.36 ms │                          37.33 ms │ +1.11x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 3464.36ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 2864.63ms │
│ Average Time (HEAD)                              │  157.47ms │
│ Average Time (hash-join-buffering-on-probe-side) │  130.21ms │
│ Queries Faster                                   │        21 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │         1 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpcds

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 1.

Last 10 lines of output:

Click to expand
BRANCH_NAME: HEAD
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

@gabotechs
Copy link
Contributor Author

🤔 the tpcds benchmark command seems broken

@adriangb
Copy link
Contributor

Big picture question: could this be part of the HashJoinExec code?

@gabotechs
Copy link
Contributor Author

Big picture question: could this be part of the HashJoinExec code?

Potentially yes, although I do see value in having this be its own node, that way we can:

  • collect metrics associated to it independently to the hash join node
  • use it in more places

This is actually something we (DataDog) have had for a long time, we have a similar BufferExec node that we decide to place not only on probe side of hash joins, but also in other places for other parts of our plans.

@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch 2 times, most recently from 6fa7c76 to 139cf50 Compare January 16, 2026 13:24
@gabotechs
Copy link
Contributor Author

run benchmark tpcds tpch

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (139cf50) to ca904b3 diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │    73.20 ms │                          72.20 ms │     no change │
│ QQuery 2  │   209.88 ms │                         166.95 ms │ +1.26x faster │
│ QQuery 3  │   158.47 ms │                         155.49 ms │     no change │
│ QQuery 4  │  1866.97 ms │                        1397.66 ms │ +1.34x faster │
│ QQuery 5  │   281.39 ms │                         296.97 ms │  1.06x slower │
│ QQuery 6  │  1378.68 ms │                        1457.54 ms │  1.06x slower │
│ QQuery 7  │   495.94 ms │                         505.73 ms │     no change │
│ QQuery 8  │   174.50 ms │                         176.50 ms │     no change │
│ QQuery 9  │   302.23 ms │                         295.84 ms │     no change │
│ QQuery 10 │   179.18 ms │                         171.69 ms │     no change │
│ QQuery 11 │  1327.09 ms │                         857.65 ms │ +1.55x faster │
│ QQuery 12 │    70.95 ms │                          60.27 ms │ +1.18x faster │
│ QQuery 13 │   537.04 ms │                         505.94 ms │ +1.06x faster │
│ QQuery 14 │  1904.23 ms │                        1582.52 ms │ +1.20x faster │
│ QQuery 15 │    30.86 ms │                          32.44 ms │  1.05x slower │
│ QQuery 16 │    67.06 ms │                          58.85 ms │ +1.14x faster │
│ QQuery 17 │   362.76 ms │                         361.85 ms │     no change │
│ QQuery 18 │   192.36 ms │                         189.70 ms │     no change │
│ QQuery 19 │   231.27 ms │                         232.61 ms │     no change │
│ QQuery 20 │    27.68 ms │                          26.37 ms │     no change │
│ QQuery 21 │    40.11 ms │                          34.05 ms │ +1.18x faster │
│ QQuery 22 │   747.50 ms │                         703.86 ms │ +1.06x faster │
│ QQuery 23 │  1750.16 ms │                        1735.08 ms │     no change │
│ QQuery 24 │   655.13 ms │                         642.33 ms │     no change │
│ QQuery 25 │   523.30 ms │                         512.55 ms │     no change │
│ QQuery 26 │   128.50 ms │                         129.16 ms │     no change │
│ QQuery 27 │   491.21 ms │                         506.80 ms │     no change │
│ QQuery 28 │   287.94 ms │                         303.87 ms │  1.06x slower │
│ QQuery 29 │   444.97 ms │                         445.72 ms │     no change │
│ QQuery 30 │    75.39 ms │                          62.77 ms │ +1.20x faster │
│ QQuery 31 │   312.87 ms │                         320.73 ms │     no change │
│ QQuery 32 │    84.70 ms │                          85.83 ms │     no change │
│ QQuery 33 │   209.39 ms │                         209.49 ms │     no change │
│ QQuery 34 │   163.14 ms │                         149.48 ms │ +1.09x faster │
│ QQuery 35 │   180.20 ms │                         169.66 ms │ +1.06x faster │
│ QQuery 36 │   289.05 ms │                         303.29 ms │     no change │
│ QQuery 37 │   253.74 ms │                         276.29 ms │  1.09x slower │
│ QQuery 38 │   159.54 ms │                         139.11 ms │ +1.15x faster │
│ QQuery 39 │   212.38 ms │                         197.46 ms │ +1.08x faster │
│ QQuery 40 │   170.56 ms │                         162.39 ms │     no change │
│ QQuery 41 │    23.56 ms │                          21.79 ms │ +1.08x faster │
│ QQuery 42 │   145.50 ms │                         148.16 ms │     no change │
│ QQuery 43 │   128.18 ms │                         118.50 ms │ +1.08x faster │
│ QQuery 44 │    29.16 ms │                          28.17 ms │     no change │
│ QQuery 45 │    89.26 ms │                          84.75 ms │ +1.05x faster │
│ QQuery 46 │   323.26 ms │                         310.91 ms │     no change │
│ QQuery 47 │  1063.80 ms │                         611.34 ms │ +1.74x faster │
│ QQuery 48 │   412.79 ms │                         377.71 ms │ +1.09x faster │
│ QQuery 49 │   362.32 ms │                         347.63 ms │     no change │
│ QQuery 50 │   342.17 ms │                         336.72 ms │     no change │
│ QQuery 51 │   304.31 ms │                         294.95 ms │     no change │
│ QQuery 52 │   146.14 ms │                         144.14 ms │     no change │
│ QQuery 53 │   151.03 ms │                         151.82 ms │     no change │
│ QQuery 54 │   225.87 ms │                         221.13 ms │     no change │
│ QQuery 55 │   146.84 ms │                         145.32 ms │     no change │
│ QQuery 56 │   209.34 ms │                         208.06 ms │     no change │
│ QQuery 57 │   299.26 ms │                         241.25 ms │ +1.24x faster │
│ QQuery 58 │   470.84 ms │                         381.81 ms │ +1.23x faster │
│ QQuery 59 │   293.29 ms │                         250.60 ms │ +1.17x faster │
│ QQuery 60 │   216.46 ms │                         212.93 ms │     no change │
│ QQuery 61 │   244.54 ms │                         255.30 ms │     no change │
│ QQuery 62 │  1274.77 ms │                        1275.72 ms │     no change │
│ QQuery 63 │   151.99 ms │                         154.21 ms │     no change │
│ QQuery 64 │  1149.58 ms │                        1153.85 ms │     no change │
│ QQuery 65 │   351.55 ms │                         326.36 ms │ +1.08x faster │
│ QQuery 66 │   403.22 ms │                         417.72 ms │     no change │
│ QQuery 67 │   559.38 ms │                         537.67 ms │     no change │
│ QQuery 68 │   372.75 ms │                         371.33 ms │     no change │
│ QQuery 69 │   175.32 ms │                         163.03 ms │ +1.08x faster │
│ QQuery 70 │   505.85 ms │                         418.04 ms │ +1.21x faster │
│ QQuery 71 │   188.24 ms │                         186.10 ms │     no change │
│ QQuery 72 │  2056.33 ms │                        2028.52 ms │     no change │
│ QQuery 73 │   158.95 ms │                         149.71 ms │ +1.06x faster │
│ QQuery 74 │   840.78 ms │                         479.78 ms │ +1.75x faster │
│ QQuery 75 │   402.97 ms │                         403.47 ms │     no change │
│ QQuery 76 │   189.46 ms │                         194.31 ms │     no change │
│ QQuery 77 │   287.12 ms │                         300.86 ms │     no change │
│ QQuery 78 │   939.94 ms │                         692.03 ms │ +1.36x faster │
│ QQuery 79 │   327.22 ms │                         314.23 ms │     no change │
│ QQuery 80 │   506.49 ms │                         490.73 ms │     no change │
│ QQuery 81 │    52.42 ms │                          52.08 ms │     no change │
│ QQuery 82 │   288.81 ms │                         286.53 ms │     no change │
│ QQuery 83 │    81.20 ms │                          72.30 ms │ +1.12x faster │
│ QQuery 84 │    69.93 ms │                          65.07 ms │ +1.07x faster │
│ QQuery 85 │   222.94 ms │                         174.64 ms │ +1.28x faster │
│ QQuery 86 │    60.20 ms │                          58.43 ms │     no change │
│ QQuery 87 │   155.69 ms │                         139.57 ms │ +1.12x faster │
│ QQuery 88 │   276.47 ms │                         274.15 ms │     no change │
│ QQuery 89 │   173.29 ms │                         159.86 ms │ +1.08x faster │
│ QQuery 90 │    46.31 ms │                          46.87 ms │     no change │
│ QQuery 91 │    96.05 ms │                          70.57 ms │ +1.36x faster │
│ QQuery 92 │    85.03 ms │                          84.30 ms │     no change │
│ QQuery 93 │   264.55 ms │                         245.48 ms │ +1.08x faster │
│ QQuery 94 │    93.23 ms │                          89.59 ms │     no change │
│ QQuery 95 │   242.40 ms │                         237.81 ms │     no change │
│ QQuery 96 │   116.21 ms │                         111.83 ms │     no change │
│ QQuery 97 │   190.30 ms │                         171.59 ms │ +1.11x faster │
│ QQuery 98 │   219.09 ms │                         174.78 ms │ +1.25x faster │
│ QQuery 99 │ 14262.08 ms │                       14135.21 ms │     no change │
└───────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 50517.49ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 47295.97ms │
│ Average Time (HEAD)                              │   510.28ms │
│ Average Time (hash-join-buffering-on-probe-side) │   477.74ms │
│ Queries Faster                                   │         37 │
│ Queries Slower                                   │          5 │
│ Queries with No Change                           │         57 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (139cf50) to ca904b3 diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 175.65 ms │                         179.02 ms │     no change │
│ QQuery 2  │  89.87 ms │                          78.50 ms │ +1.14x faster │
│ QQuery 3  │ 125.34 ms │                         122.27 ms │     no change │
│ QQuery 4  │  79.15 ms │                          74.96 ms │ +1.06x faster │
│ QQuery 5  │ 175.79 ms │                         181.39 ms │     no change │
│ QQuery 6  │  68.11 ms │                          62.60 ms │ +1.09x faster │
│ QQuery 7  │ 212.98 ms │                         210.80 ms │     no change │
│ QQuery 8  │ 163.59 ms │                         163.50 ms │     no change │
│ QQuery 9  │ 232.96 ms │                         229.13 ms │     no change │
│ QQuery 10 │ 185.74 ms │                         193.67 ms │     no change │
│ QQuery 11 │  63.32 ms │                          59.65 ms │ +1.06x faster │
│ QQuery 12 │ 120.67 ms │                         119.02 ms │     no change │
│ QQuery 13 │ 219.39 ms │                         206.11 ms │ +1.06x faster │
│ QQuery 14 │  89.87 ms │                          91.05 ms │     no change │
│ QQuery 15 │ 124.56 ms │                         101.46 ms │ +1.23x faster │
│ QQuery 16 │  60.16 ms │                          55.35 ms │ +1.09x faster │
│ QQuery 17 │ 271.62 ms │                         258.37 ms │     no change │
│ QQuery 18 │ 313.52 ms │                         306.53 ms │     no change │
│ QQuery 19 │ 135.88 ms │                         136.65 ms │     no change │
│ QQuery 20 │ 133.43 ms │                         124.26 ms │ +1.07x faster │
│ QQuery 21 │ 263.86 ms │                         244.11 ms │ +1.08x faster │
│ QQuery 22 │  42.09 ms │                          39.17 ms │ +1.07x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 3347.55ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 3237.59ms │
│ Average Time (HEAD)                              │  152.16ms │
│ Average Time (hash-join-buffering-on-probe-side) │  147.16ms │
│ Queries Faster                                   │        10 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │        12 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@apache apache deleted a comment from alamb-ghbot Jan 19, 2026
@apache apache deleted a comment from alamb-ghbot Jan 19, 2026
@gabotechs
Copy link
Contributor Author

I think for this to be optimal further complex logic will need to be introduced so that buffering can properly interact with dynamic filtering.

However, even if we decide to not buffer eagerly in case there's a dynamic filter below a BufferExec, it does show some small speedups even in benchmarks that pull data from local files, which means that speedups in more IO bounded environments (files in S3) will be even more noticeable.

I want to propose moving forward with this, but with buffering disabled by default. That way, users can decide to opt into it until we find a smarter rule of when to apply buffering automatically based on the query shape and where are dynamic filters placed.

The steps I want to propose are:

  • Do not require mut in memory reservation methods #19759: review and ship this one first in isolation
  • Add BufferExec execution plan #19760: just add the BufferExec node without any wireup to the rest of DataFusion
  • Remove all the ExecutionPlan modifications from this PR and just move forward with the physical optimizer rule, leaving it disabled by default
  • Do some follow ups in order to ship any smarter logic that opens the door to apply buffering automatically

@Dandandan what do you think?

github-merge-queue bot pushed a commit that referenced this pull request Jan 26, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Prerequisite for the following PRs:
- #19760
- #19761

Even if the api on the `MemoryPool` does not require `&mut self` for
growing/shrinking the reserved size, the api in `MemoryReservation`
does, making simple implementations irrepresentable without
synchronization primitives. For example, the following would require a
`Mutex` for concurrent access to the `MemoryReservation` in different
threads, even though the `MemoryPool` doesn't:

```rust
let mut stream: SendableRecordBatchStream = SendableRecordBatchStream::new();
let mem: Arc<MemoryReservation> = Arc::new(MemoryReservation::new_empty());

let mut builder = ReceiverStreamBuilder::new(10);
let tx = builder.tx();
{
    let mem = mem.clone();
    builder.spawn(async move {
        while let Some(msg) = stream.next().await {
            mem.try_grow(msg.unwrap().get_array_memory_size()); // ❌ `mem` is not mutable
            tx.send(msg).unwrap();
        }
    });
}
builder
    .build()
    .inspect_ok(|msg| mem.shrink(msg.get_array_memory_size()));  // ❌ `mem` is not mutable
```


## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Make the methods in `MemoryReservation` require `&self` instead of `&mut
self` for allowing concurrent shrink/grows from different tasks for the
same reservation.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

yes, by current tests

## Are there any user-facing changes?

Users can now safely call methods of `MemoryReservation` from different
tasks without synchronization primitives.

This is a backwards compatible API change, as it will work out of the
box for current users, however, depending on their clippy configuration,
they might see some new warnings about "unused muts" in their codebase.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch from 139cf50 to 3bd48fc Compare February 1, 2026 11:10
@github-actions github-actions bot removed core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) execution Related to the execution crate datasource Changes to the datasource crate labels Feb 1, 2026
@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch 2 times, most recently from 41e24b7 to 5a4f11f Compare February 1, 2026 11:17
@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch from 5a4f11f to d2551e7 Compare February 1, 2026 11:46
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 1, 2026
@gabotechs gabotechs marked this pull request as ready for review February 1, 2026 12:02
@gabotechs
Copy link
Contributor Author

gabotechs commented Feb 1, 2026

I want to propose moving forward with #19760 and this PR with the config parameter disabled by default, as it shows performance improvements too god to dismiss (see PR description for updated benchmarks results). cc @adriangb as it's tangentially related with dynamic filters and @Dandandan as you seem to have interest in improving join performance.

@adriangb
Copy link
Contributor

adriangb commented Feb 1, 2026

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (d2551e7) to 51c0475 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃       Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0 │  2275.40 ms │                        2288.44 ms │    no change │
│ QQuery 1 │   871.57 ms │                         933.75 ms │ 1.07x slower │
│ QQuery 2 │  1767.74 ms │                        1755.42 ms │    no change │
│ QQuery 3 │  1058.02 ms │                        1041.76 ms │    no change │
│ QQuery 4 │  2201.53 ms │                        2204.96 ms │    no change │
│ QQuery 5 │ 28183.46 ms │                       28475.63 ms │    no change │
│ QQuery 6 │  4002.70 ms │                        3879.12 ms │    no change │
│ QQuery 7 │  2818.12 ms │                        2828.21 ms │    no change │
└──────────┴─────────────┴───────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 43178.53ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 43407.29ms │
│ Average Time (HEAD)                              │  5397.32ms │
│ Average Time (hash-join-buffering-on-probe-side) │  5425.91ms │
│ Queries Faster                                   │          0 │
│ Queries Slower                                   │          1 │
│ Queries with No Change                           │          7 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     2.55 ms │                           2.63 ms │     no change │
│ QQuery 1  │    53.08 ms │                          51.55 ms │     no change │
│ QQuery 2  │   137.55 ms │                         143.36 ms │     no change │
│ QQuery 3  │   151.61 ms │                         156.07 ms │     no change │
│ QQuery 4  │  1008.41 ms │                        1065.74 ms │  1.06x slower │
│ QQuery 5  │  1248.57 ms │                        1298.33 ms │     no change │
│ QQuery 6  │     8.10 ms │                           8.02 ms │     no change │
│ QQuery 7  │    54.49 ms │                          55.11 ms │     no change │
│ QQuery 8  │  1389.19 ms │                        1388.96 ms │     no change │
│ QQuery 9  │  1769.89 ms │                        1768.78 ms │     no change │
│ QQuery 10 │   341.96 ms │                         345.85 ms │     no change │
│ QQuery 11 │   398.56 ms │                         400.51 ms │     no change │
│ QQuery 12 │  1171.12 ms │                        1229.80 ms │  1.05x slower │
│ QQuery 13 │  1857.50 ms │                        1892.76 ms │     no change │
│ QQuery 14 │  1204.33 ms │                        1240.29 ms │     no change │
│ QQuery 15 │  1167.09 ms │                        1187.09 ms │     no change │
│ QQuery 16 │  2515.28 ms │                        2522.10 ms │     no change │
│ QQuery 17 │  2406.44 ms │                        2496.47 ms │     no change │
│ QQuery 18 │  5454.78 ms │                        4852.87 ms │ +1.12x faster │
│ QQuery 19 │   122.14 ms │                         126.47 ms │     no change │
│ QQuery 20 │  1904.12 ms │                        1921.05 ms │     no change │
│ QQuery 21 │  2149.49 ms │                        2190.52 ms │     no change │
│ QQuery 22 │  3710.57 ms │                        3749.03 ms │     no change │
│ QQuery 23 │ 23632.33 ms │                       12228.40 ms │ +1.93x faster │
│ QQuery 24 │   213.10 ms │                         214.37 ms │     no change │
│ QQuery 25 │   471.09 ms │                         465.74 ms │     no change │
│ QQuery 26 │   231.30 ms │                         221.43 ms │     no change │
│ QQuery 27 │  2715.85 ms │                        2663.95 ms │     no change │
│ QQuery 28 │ 22317.52 ms │                       23225.07 ms │     no change │
│ QQuery 29 │   970.22 ms │                         969.30 ms │     no change │
│ QQuery 30 │  1266.59 ms │                        1257.08 ms │     no change │
│ QQuery 31 │  1337.66 ms │                        1322.67 ms │     no change │
│ QQuery 32 │  4543.59 ms │                        4234.52 ms │ +1.07x faster │
│ QQuery 33 │  5409.06 ms │                        5206.12 ms │     no change │
│ QQuery 34 │  5956.17 ms │                        5543.61 ms │ +1.07x faster │
│ QQuery 35 │  1914.06 ms │                        1852.12 ms │     no change │
│ QQuery 36 │   194.63 ms │                         198.01 ms │     no change │
│ QQuery 37 │    77.47 ms │                          76.48 ms │     no change │
│ QQuery 38 │   120.15 ms │                         117.04 ms │     no change │
│ QQuery 39 │   350.31 ms │                         341.41 ms │     no change │
│ QQuery 40 │    46.48 ms │                          43.76 ms │ +1.06x faster │
│ QQuery 41 │    42.02 ms │                          38.36 ms │ +1.10x faster │
│ QQuery 42 │    34.70 ms │                          33.38 ms │     no change │
└───────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 102071.11ms │
│ Total Time (hash-join-buffering-on-probe-side)   │  90346.17ms │
│ Average Time (HEAD)                              │   2373.75ms │
│ Average Time (hash-join-buffering-on-probe-side) │   2101.07ms │
│ Queries Faster                                   │           6 │
│ Queries Slower                                   │           2 │
│ Queries with No Change                           │          35 │
│ Queries with Failure                             │           0 │
└──────────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 100.47 ms │                         102.17 ms │     no change │
│ QQuery 2  │  33.22 ms │                          32.78 ms │     no change │
│ QQuery 3  │  40.33 ms │                          40.23 ms │     no change │
│ QQuery 4  │  31.52 ms │                          31.27 ms │     no change │
│ QQuery 5  │  90.88 ms │                          90.50 ms │     no change │
│ QQuery 6  │  20.92 ms │                          21.17 ms │     no change │
│ QQuery 7  │ 164.22 ms │                         163.13 ms │     no change │
│ QQuery 8  │  39.93 ms │                          43.71 ms │  1.09x slower │
│ QQuery 9  │ 103.68 ms │                         107.76 ms │     no change │
│ QQuery 10 │  65.07 ms │                          68.02 ms │     no change │
│ QQuery 11 │  18.37 ms │                          18.63 ms │     no change │
│ QQuery 12 │  51.44 ms │                          51.15 ms │     no change │
│ QQuery 13 │  47.36 ms │                          49.36 ms │     no change │
│ QQuery 14 │  15.18 ms │                          15.03 ms │     no change │
│ QQuery 15 │  30.47 ms │                          30.62 ms │     no change │
│ QQuery 16 │  28.16 ms │                          28.62 ms │     no change │
│ QQuery 17 │ 143.74 ms │                         146.47 ms │     no change │
│ QQuery 18 │ 293.44 ms │                         276.71 ms │ +1.06x faster │
│ QQuery 19 │  41.30 ms │                          41.59 ms │     no change │
│ QQuery 20 │  57.07 ms │                          55.84 ms │     no change │
│ QQuery 21 │ 192.92 ms │                         197.77 ms │     no change │
│ QQuery 22 │  22.79 ms │                          22.19 ms │     no change │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 1632.48ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 1634.71ms │
│ Average Time (HEAD)                              │   74.20ms │
│ Average Time (hash-join-buffering-on-probe-side) │   74.30ms │
│ Queries Faster                                   │         1 │
│ Queries Slower                                   │         1 │
│ Queries with No Change                           │        20 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants