Skip to content

perf: direct JSON renderer bypassing Visitor pattern for JVM CLI output#735

Closed
He-Pin wants to merge 2 commits intodatabricks:masterfrom
He-Pin:perf/direct-stdout-render
Closed

perf: direct JSON renderer bypassing Visitor pattern for JVM CLI output#735
He-Pin wants to merge 2 commits intodatabricks:masterfrom
He-Pin:perf/direct-stdout-render

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 10, 2026

Motivation

The upickle Visitor/ObjVisitor/ArrVisitor pattern is the primary materialization bottleneck on JVM. On realistic2, the Visitor-based rendering dispatches ~3.3 million virtual method calls (per-element visitValue, visitKey, visitString, etc.). The JIT cannot fully devirtualize these because multiple Visitor implementations exist in the classpath.

Key Design Decisions

  1. JVM-only optimization: Platform.useDirectRenderer is a final valtrue on JVM, false on Native. Native's LLVM LTO already devirtualizes the Visitor pattern at link time, making the direct renderer counterproductive there (measured: 6.8% regression → neutral after flag).

  2. Deep nesting safety: Falls back to the Materializer's hybrid recursive/iterative path (ArrayDeque-based materializeStackless) for subtrees beyond recursiveDepthLimit (128). This prevents stack overflow on deeply nested structures while keeping the fast path for normal depths.

  3. Materializable compatibility: Falls back to Visitor-based Renderer for custom Materializer.Materializable values, preserving the embedding API. The fallback Renderer overrides flushCharBuilder() with threshold 0 to prevent silent data loss (BaseCharRenderer uses threshold 1000 at depth ≥ 1).

  4. Output equivalence: Produces byte-identical JSON output to Renderer for all indent values (minified, indent=0, indent>0). Empty containers render as { } / [ ] matching Renderer behavior.

Modification

  • New file: DirectJsonRenderer.scala (~320 lines) — final class with StringBuilder-based JSON rendering, single-pass string escaping, pre-computed indent cache, valTag-based O(1) dispatch.
  • Interpreter.scala: Added interpretStringify/materializeStringify for the direct-to-string pipeline, extracted createMaterializer() to share between old and new paths.
  • SjsonnetMainBase.scala: Fast-path in renderNormal when Platform.useDirectRenderer && !yamlOut && !expectString && outputFile.isEmpty.
  • Platform.scala (JVM/Native): Added useDirectRenderer flag.

Benchmark Results

JMH (JVM, ms/op, lower is better)

Benchmark Master DirectRenderer Δ
realistic2 62.0 55.9 -9.8%
realistic1 2.0 1.8 -6.1%
large_string_template 1.8 1.7 -8.3%
bench.02 33.0 32.6 -1.3%
gen_big_object 0.93 0.91 -2.2%
gen_big_string 0.24 0.23 noise

Hyperfine (Scala Native, realistic2 > /dev/null)

Binary Mean (ms) vs jrsonnet
sjsonnet master 251.8 ± 2.7 2.52x
sjsonnet DirectRenderer 252.5 ± 4.1 2.52x
jrsonnet 100.1 ± 2.5 1.00x

Native is neutral by design (DirectRenderer disabled via Platform.useDirectRenderer = false).

Analysis

The optimization targets JVM-specific overhead:

  • Eliminated: ~3.3M virtual dispatch calls per realistic2 evaluation (ObjVisitor/ArrVisitor allocation + visitValue/visitString/visitEnd calls)
  • Eliminated: CharBuilder → Writer → StringBuffer intermediate pipeline and StringBuffer synchronization
  • Preserved: Native performance (LLVM LTO already handles devirtualization), deep nesting safety, custom Materializable support

Result

JVM materialization throughput improved by ~8-10% on output-heavy benchmarks. No regression on any benchmark. Native performance unchanged (by design).

@He-Pin He-Pin closed this Apr 10, 2026
@He-Pin He-Pin reopened this Apr 10, 2026
@He-Pin He-Pin force-pushed the perf/direct-stdout-render branch 4 times, most recently from 25e7066 to e28e428 Compare April 10, 2026 23:17
Add DirectJsonRenderer that produces JSON directly via StringBuilder,
eliminating the upickle Visitor/ObjVisitor/ArrVisitor overhead that
dominates materialization cost on the JVM (~3.3M virtual dispatch calls
on realistic2).

Key design decisions:
- JVM-only optimization via Platform.useDirectRenderer (final val).
  Native's LLVM LTO already devirtualizes the Visitor pattern, making
  the direct renderer counterproductive there.
- Falls back to Materializer+Renderer for:
  (a) Materializable custom Val types (preserving embedding API)
  (b) Subtrees beyond recursiveDepthLimit (128) to use the iterative
      ArrayDeque-based materializer, preventing stack overflow on
      deeply nested structures.
- Fallback Renderer overrides flushCharBuilder() with threshold 0 to
  prevent silent data loss: BaseCharRenderer uses threshold 1000 at
  depth >= 1, which would cause small outputs to stay in elemBuilder.
- Output is byte-identical to Renderer for all indent values.

JMH results (JVM, ms/op, lower is better):
  realistic2:            62.0 → 55.9 (-9.8%)
  realistic1:             2.0 →  1.8 (-6.1%)
  large_string_template:  1.8 →  1.7 (-8.3%)
  bench.02:              33.0 → 32.6 (-1.3%)

Hyperfine Native (realistic2 > /dev/null):
  master:     251.8ms (no regression after Platform flag)
  jrsonnet:   100.1ms (2.52x faster — gap unchanged on Native)
@He-Pin He-Pin force-pushed the perf/direct-stdout-render branch from e28e428 to f5ca848 Compare April 10, 2026 23:26
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 11, 2026

Superseded by #745 (byte-renderer pipeline with fused materializer) and #747 (fused materializer for stdout). Those PRs implement the same concept with better optimization.

@He-Pin He-Pin closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant