Skip to content

bench(gradle): harden JMH suite against run-to-run variance#9

Open
not-matthias wants to merge 6 commits intomainfrom
cod-2519-harden-codspeed-jvm-against-regressions
Open

bench(gradle): harden JMH suite against run-to-run variance#9
not-matthias wants to merge 6 commits intomainfrom
cod-2519-harden-codspeed-jvm-against-regressions

Conversation

@not-matthias
Copy link
Copy Markdown
Member

@not-matthias not-matthias commented Apr 17, 2026

  1. forceGC = true in the Gradle jmh block — System.gc() between
    iterations so a GC pause can't land mid-measurement.
  2. -Xbatch JVM arg — synchronous JIT compilation so the C2 background
    thread can't steal cycles during measurement.
  3. Trim Param matrices across DP / Bit / Rle / Regex benchmarks to a
    single representative value each.
  4. SortBenchmark trim + allocation fix — drop 5 redundant sort variants,
    keep one size, replace Arrays.copyOf (fresh Integer[] per invocation)
    with System.arraycopy into a pre-allocated working buffer. This is the
    direct fix for the timSort[100] flakiness called out in the ticket.
  5. BacktrackingBenchmark trim + allocation hoist — reduce each Param to 2
    values and move Integer[] / ArrayList allocations out of the timed
    region via Setup.
  6. Tune Warmup(10) / Measurement(10) / Fork(1) on all 8 benchmark classes.
    Generous warmup reaches JIT steady state; generous measurement iterations
    give tight per-fork confidence intervals. Fork(1) leaves cross-JVM variance
    sampling to the 6-distribution CI matrix.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 17, 2026

Merging this PR will degrade performance by 14.35%

⚡ 12 improved benchmarks
❌ 2 regressed benchmarks
✅ 28 untouched benchmarks
⏩ 92 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
generateCombinations[9] 28.6 µs 24 µs +19.3%
fib[30] 8.4 ms 9.4 ms -11.22%
permutations[7] 554.3 µs 451.8 µs +22.7%
mergeSort[10000] 2.9 ms 2.2 ms +35.3%
levenshteinDistance[saturday sunday] 400 ns 467 ns -14.35%
dualPivotQuickSort[10000] 4.9 ms 3.3 ms +48.21%
quickSort[10000] 2.5 ms 2.2 ms +13.72%
generateCombinations[7] 12.6 µs 11 µs +14.35%
fibonacciOptimized[30] 31 ns 27 ns +14.81%
fibonacciBottomUp[30] 2.7 µs 2.2 µs +23.65%
permutations[5] 12.8 µs 11 µs +16.9%
multiPatternScan[24] 6.4 ms 5.3 ms +19.29%
compileAndMatch[24] 4 ms 3.6 ms +11.27%
timSort[10000] 2.4 ms 1.6 ms +48.37%

Comparing cod-2519-harden-codspeed-jvm-against-regressions (dacf718) with main (c7c19aa)

Open in CodSpeed

Footnotes

  1. 92 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Enabling forceGC calls System.gc() before each measurement iteration so
a concurrent collection cannot land inside the measurement window and
show up as a spurious regression. Addresses part of COD-2519 (unchanged
PRs regressing due to GC noise).
@not-matthias not-matthias force-pushed the cod-2519-harden-codspeed-jvm-against-regressions branch from c6d648e to 1d8ea65 Compare April 17, 2026 18:01
@not-matthias not-matthias requested a review from Copilot April 17, 2026 18:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the example Gradle JMH benchmark suite to reduce run-to-run variance in CI by adjusting JMH configuration, trimming parameter matrices, and removing allocation noise from timed regions.

Changes:

  • Increase warmup/measurement iterations (10/10) and standardize on @Fork(1) across benchmarks.
  • Trim benchmark @Param matrices to fewer representative values to reduce CI wall-clock and reduce noise.
  • Remove per-invocation allocations from hot benchmark paths (notably SortBenchmark and parts of BacktrackingBenchmark) and add JMH JVM/GC stabilization knobs in Gradle.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
examples/example-gradle/build.gradle.kts Enable forceGC and add -Xbatch to reduce GC/JIT jitter during measurement.
examples/example-gradle/src/jmh/java/com/thealgorithms/sorts/SortBenchmark.java Reduce param sizes, drop redundant sort variants, and replace per-invocation Arrays.copyOf with System.arraycopy into a reusable buffer.
examples/example-gradle/src/jmh/java/bench/SleepBenchmark.java Increase warmup/measurement iterations for stability.
examples/example-gradle/src/jmh/java/bench/RleBenchmark.java Increase warmup/measurement iterations and reduce input-size params.
examples/example-gradle/src/jmh/java/bench/RegexBenchmark.java Increase warmup/measurement iterations and reduce backtracking param matrix.
examples/example-gradle/src/jmh/java/bench/FibBenchmark.java Increase warmup/measurement iterations.
examples/example-gradle/src/jmh/java/bench/DynamicProgrammingBenchmark.java Increase warmup/measurement iterations and trim multiple DP benchmark params.
examples/example-gradle/src/jmh/java/bench/BitManipulationBenchmark.java Increase warmup/measurement iterations and trim bit-value params.
examples/example-gradle/src/jmh/java/bench/BacktrackingBenchmark.java Increase warmup/measurement iterations, trim params, and hoist allocations into @Setup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/example-gradle/src/jmh/java/bench/FibBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/DynamicProgrammingBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/BitManipulationBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/BacktrackingBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/SleepBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/RleBenchmark.java Outdated
Comment thread examples/example-gradle/src/jmh/java/bench/RegexBenchmark.java Outdated
-Xbatch serializes JIT work onto the benchmark thread. Fixed heap size,
pre-touched pages, serial GC, and disabled adaptive sizing eliminate
heap resizes, page faults, and concurrent GC jitter during measurement.
The DP, bit-manipulation, RLE and regex benchmarks each enumerated 2-5
input sizes purely to show scaling. For CodSpeed regression detection we
only need one point on the curve per benchmark method — additional
values multiply CI wall-clock without adding signal. Part of COD-2519.
…tion

Two related changes:

- Keep only 4 representative sort algorithms (quickSort, mergeSort,
  timSort, dualPivotQuickSort). The five dropped variants (heap,
  insertion, selection, shell, introspective) sort the same Integer[]
  input through similar code paths and don't exercise distinct
  regressions. Also narrows the param matrix to a single size (10000).

- Replace copyData()'s Arrays.copyOf — which allocated a fresh Integer[]
  on every invocation — with a System.arraycopy into a pre-allocated
  working buffer. For small sizes the allocation and its GC pressure
  dominated the sort work itself and was the primary source of the
  timSort[100] flakiness called out in COD-2519.
…setup

Reduce each Param to 2 representative values (was 3-5). The smallest
values in the original sets were well below the JMH harness noise floor.

Also pre-allocate the Integer[] / ArrayList inputs in Setup(Trial) for
generateCombinations, permutations, and generateSubsequences.
Previously each Benchmark method allocated these structures inline on
every invocation, creating GC pressure that showed up as run-to-run
variance. Part of COD-2519.
All 8 benchmark classes previously ran at Warmup(1), Measurement(3),
Fork(1) — too little of everything to produce stable JMH numbers. In
particular Fork(1) is the core of COD-2519: a single JVM launch per
benchmark can't separate real regressions from JIT or ASLR
luck-of-the-draw.

Settle on Warmup(10), Measurement(10), Fork(1) for a ~14 min total CI
budget across the ~40 combos. Generous warmup reaches C2 steady state;
generous measurement iterations give JMH tight confidence intervals on
the per-fork score. Fork(1) leaves cross-JVM variance sampling to the
6-distribution CI matrix, which already runs each benchmark in 6
independent JVMs.
@not-matthias not-matthias force-pushed the cod-2519-harden-codspeed-jvm-against-regressions branch from af4a59a to dacf718 Compare April 20, 2026 16:24
Copy link
Copy Markdown

@GuillaumeLagrange GuillaumeLagrange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLGTM but I have two remarks

  1. Do we have stats of the variance before/after changes in order to make sure we're not throwing things around randomly and hope it improves stuff?
  2. The warmup and iterations required for our benchmarks to be less noisy should be somewhere in the docs as guidelines, because people ARE going to report unreliable results, as they should, if we encountered them ourselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants