Skip to content

perf: direct-write stdout with unsynchronized CompactByteArrayOutputStream#680

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/direct-write-stdout
Open

perf: direct-write stdout with unsynchronized CompactByteArrayOutputStream#680
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/direct-write-stdout

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

When outputting to stdout (no --output-file), sjsonnet renders JSON into a StringWriter backed by StringBuffer, calls toString() to create the full string, then println() to encode and write it. For realistic2 (28.6 MB JSON output), this creates three sources of overhead:

  1. StringBuffer synchronization: Every write() call on StringBuffer is synchronized. On Scala Native, each synchronized block maps to a pthread mutex lock/unlock pair — on a 28MB output with ~thousands of write calls, this is significant.
  2. Redundant memory copy: StringWriter.toString() creates a full copy of the 28MB char array into a String. Combined with StringBuffer's 2x growth factor, peak memory reaches ~3x the output size.
  3. Encoding overhead: PrintStream.println(String) must encode the entire 28MB String from UTF-16 chars to UTF-8 bytes for stdout.

Key Design Decision

Inspired by Apache Pekko's unsynchronized buffer approach, we render directly to bytes and write them to stdout in a single bulk operation:

  • CompactByteArrayOutputStream: Private inner class, unsynchronized, 1.5x growth factor, writeTo() for zero-copy transfer
  • Error safety: On rendering error, the buffer is simply discarded — nothing reaches stdout
  • Fallback: When stdout is null (library/programmatic use) or --output-file is specified, the original StringWriter path is used
BEFORE (per write call, thousands of calls for 28MB):
  Renderer → CharBuilder → StringWriter(StringBuffer.write() [SYNCHRONIZED])
  ... → StringWriter.toString() [28MB CHAR COPY]
  ... → println(string) [28MB UTF-16→UTF-8 ENCODE]

AFTER:
  Renderer → OutputStreamWriter → CompactByteArrayOutputStream.write() [NO SYNC]
  ... → CompactByteArrayOutputStream.writeTo(stdout) [SINGLE BULK WRITE]

Modification

sjsonnet/src-jvm-native/sjsonnet/SjsonnetMainBase.scala:

  • Added CompactByteArrayOutputStream private inner class (unsynchronized, 1.5x growth, writeTo)
  • Added stdout: PrintStream parameter to writeToFile, renderNormal, and mainConfigured
  • New code path when stdout != null and no output file: renders through OutputStreamWriterCompactByteArrayOutputStreamwriteTo(stdout)flush()
  • On rendering error, buffer is discarded (atomicity preserved)

Benchmark Results

Environment: Apple M3 Max, macOS 15.4, Scala Native 0.5.8

Scala Native — hyperfine (warmup 3, runs 10)

Benchmark master (ms) PR (ms) Δ vs jrsonnet
realistic2 (28.6MB) 258.7 ± 2.4 226.8 ± 9.0 −12.3% 2.21x (was 2.52x)
large_string_template 23.6 ± 10.3 15.1 ± 1.7 −36.0% 1.89x (was 2.94x)
realistic1 15.2 ± 1.4 15.8 ± 1.9 ~0% (noise) 1.00x
gen_big_object 10.9 ± 1.1 11.7 ± 2.4 ~0% (noise) 0.89x (we win)
large_string_join 7.6 ± 1.6 7.9 ± 1.2 ~0% (noise) 1.16x

JMH (JVM steady-state — I/O not measured, no change expected)

Benchmark ms/op
realistic2 57.1
realistic1 1.8
comparison 21.8
comparison2 40.1
large_string_template 1.8
gen_big_object 0.9

Analysis

  • 12.3% improvement on realistic2 (28.6MB output) — the largest single-benchmark improvement from eliminating StringBuffer synchronization and intermediate String copy
  • 36% improvement on large_string_template — string-heavy output benefits even more from direct byte encoding
  • No regression on small-output benchmarks (realistic1, gen_big_object, large_string_join)
  • JMH unaffected — JMH measures interpret() which returns a String, not the CLI output path
  • Error safety preserved: CompactByteArrayOutputStream buffers all output; only on success does writeTo(stdout) transfer the bytes

References

Result

Eliminates StringWriter/StringBuffer synchronization overhead and intermediate String copy for stdout output. Improves realistic2 by 12.3% and large_string_template by 36% on Scala Native. No functional change — only affects the CLI I/O path.

@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 02:04
@He-Pin He-Pin marked this pull request as draft April 5, 2026 18:24
@He-Pin He-Pin force-pushed the perf/direct-write-stdout branch 5 times, most recently from b2862ad to e48a9f2 Compare April 10, 2026 09:30
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 10, 2026

Will ByteArrayOutputStream too, there is an optimization in pekko, and can be used here too.

@He-Pin He-Pin force-pushed the perf/direct-write-stdout branch from 5297f3e to 1f9acf0 Compare April 10, 2026 16:55
@He-Pin He-Pin marked this pull request as ready for review April 10, 2026 17:18
@stephenamar-db
Copy link
Copy Markdown
Collaborator

workflows are working again.

@He-Pin He-Pin force-pushed the perf/direct-write-stdout branch 9 times, most recently from 496819d to caa8239 Compare April 11, 2026 00:08
@He-Pin He-Pin marked this pull request as draft April 11, 2026 02:32
…tream

Bypass StringWriter → toString → println overhead when writing to stdout
by buffering rendered output in CompactByteArrayOutputStream and writing
it directly via writeTo(stdout).

CompactByteArrayOutputStream is inspired by Apache Pekko's unsynchronized
buffer approach:
- No synchronization on write ops (avoids pthread mutex on Scala Native)
- 1.5x growth factor (vs 2x in BAOS) reduces memory waste
- writeTo() provides zero-copy transfer to stdout
- On error, buffer is simply discarded (atomicity)

When outputting to a file (-o), the original StringWriter path is used.

Benchmark (Scala Native, hyperfine --warmup 3 --runs 10):
  realistic2: 258.7ms → 226.8ms (-12.3%)
  large_string_template: 23.6ms → 15.1ms (-36.0%)
@He-Pin He-Pin force-pushed the perf/direct-write-stdout branch from caa8239 to 67c1157 Compare April 11, 2026 02:40
@He-Pin He-Pin changed the title perf: direct-write stdout bypass StringWriter/StringBuffer allocation perf: direct-write stdout with unsynchronized CompactByteArrayOutputStream Apr 11, 2026
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 11, 2026

Done! Updated to use a Pekko-inspired CompactByteArrayOutputStream:

  • Unsynchronized: No pthread mutex overhead on Scala Native (vs java.io.ByteArrayOutputStream which has synchronized on every write())
  • 1.5x growth factor: Reduced memory waste vs the standard 2x doubling
  • writeTo(OutputStream): Zero-copy transfer to stdout — no intermediate String creation

Results on Scala Native (hyperfine):

  • realistic2 (28.6MB): 258.7ms → 226.8ms (−12.3%)
  • large_string_template: 23.6ms → 15.1ms (−36%)

@He-Pin He-Pin marked this pull request as ready for review April 11, 2026 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants