Skip to content

Latest commit

 

History

History
200 lines (152 loc) · 10.4 KB

File metadata and controls

200 lines (152 loc) · 10.4 KB

ClojureWasm Optimizations

Completed optimizations and future opportunities, ordered by introduction.

Environment: Apple M4 Pro, 48 GB RAM, Darwin 25.2.0, ReleaseSafe Measurement: hyperfine (bench/history.yaml) Result: CW wins cold 14/19 vs Babashka, NaN boxing (D85) live, zwasm (D92) integrated


1. Completed Optimizations

Phase 3-23: Foundational

# Optimization Phase Key Impact
1 Arithmetic intrinsics 3 Direct opcodes for +,-,*,/,<,>,= etc.
2 BuiltinFn pointer dispatch 3 No var resolution for core builtins
3 Unified callFnVal dispatch (D36) 10 Single dispatch point for all fn types
4 VM heap allocation (D71) 22c Prevents C stack overflow in nested calls
5 Mark-Sweep GC (D69, D70) 23 Three-allocator architecture

Phase 24A: VM Core (Pre-24 → 24A.9)

# Task Optimization Biggest Improvement
6 24A.1 Switch dispatch + batched GC fib_loop 56→31ms (1.8x)
7 24A.2 Stack arg buffer (TreeWalk) Avoids GC allocation per call
8 24A.3 Fused reduce (lazy-seq collapse) sieve 2152→40ms (54x)
9 24A.4 Arithmetic fast-path inlining fib_recursive 502→41ms (12x)
10 24A.5 Monomorphic inline cache protocol_dispatch 30→27ms
11 24A.9 @branchHint annotations fib_recursive 41→28ms (1.5x)

Phase 24B: Data Structures

# Task Optimization Biggest Improvement
12 24B.2 HAMT persistent hash map map_ops 26→14ms (1.9x)
13 24B.4 GC tuning + meta tracing Fixed F97 sieve crash

Phase 24C: Babashka Parity (24A.9 → 24C.10)

# Task Optimization Biggest Improvement
14 24C.1 Fix fused reduce (__zig-lazy-map) lazy_chain 6655→17ms (391x)
15 24C.2 Multimethod 2-level cache multimethod 2053→14ms (147x)
16 24C.3 String stack buffer fast path string_ops 398→28ms (14x)
17 24C.4 Vector geometric COW + Cons cells vector_ops 180→14ms (13x)
18 24C.5 GC free-pool recycling gc_stress 324→46ms (7x)
19 24C.5b Two-phase bootstrap (D73) transduce 2134→15ms (142x)
20 24C.7 Filter chain collapsing (D74) sieve 1645→16ms (103x)
21 24C.9 Zig builtins for update-in etc. nested_update 39→23ms (1.7x)
22 24C.10 Collection constructor intrinsics gc_stress 55→35ms (1.6x)

Phase 35X: NaN Boxing (D85)

# Optimization Impact
23 4-heap-tag NaN boxing Value 48→8 bytes, 6x cache
28 heap types, 48-bit address VM stack 1.5MB→256KB

Phase 36.7: Wasm Interpreter (D86)

# Optimization Impact
24 VM reuse (stack cache) wasm_call 931→118ms (7.9x)
25 Sidetable (branch table) wasm_fib 11046→7663ms (1.44x)
cached_memory + @memset DECIDED-AGAINST (ROI < 1%)

Phase 37.2-37.3: VM Superinstructions + Branch Fusion

# Optimization Impact
26 Superinstructions (10 fused ops) arith_loop 53→40ms (1.33x)
27 Compare-and-branch fusion (7 ops) arith_loop 40→31ms (1.29x)
28 Recur-loop fusion Dispatch: 6→4 per loop iteration
Cumulative (37.1 base → 37.3) arith_loop 53→31ms (1.71x)

Phase 37.4: JIT PoC — ARM64 Hot Loop Native Code (D87)

# Optimization Impact
29 ARM64 JIT (hot integer loops) arith_loop 31→3ms (10.3x)
used_slots bitset (skip fn_val) Avoids deopt on closure self-ref
THEN path skip in analyzeLoop Handles real compiler bytecode
Cumulative (37.1 base → 37.4) arith_loop 53→3ms (17.7x)

2. Performance Summary

End-to-End Progression (Pre-24 → Post-zwasm)

Benchmark Pre-24 24C.10 Post-zwasm Speedup
fib_recursive 542 24 18 30x
map_filter_reduce 4,013 17 6 669x
sieve 2,152 16 6 359x
lazy_chain 21,375 16 9 2,375x
transduce 8,409 16 6 1,402x
multimethod_dispatch 2,373 15 6 396x
real_workload 1,286 22 9 143x

All times in ms (warm), ReleaseSafe, Apple M4 Pro. Post-zwasm = latest entry. Full table: bench/history.yaml.

Cross-Language Summary (Cold, Phase 24C.10)

CW vs Babashka (Cold): CW wins 14/19, BB wins 5/19, 1 skip. CW vs Ruby: CW wins 20/20. CW vs Java: CW wins 18/20. CW vs Python: CW wins 11/20.

Startup: C 3.9 / Zig 5.8 / BB 8.0 / Py 11.1 / CW 14.2 / Java 21.2 / Ruby 30.1 ms.

Re-run with bash bench/compare_langs.sh --both for latest numbers.


3. Future Optimizations

CW-side (can implement now)

ID Technique Expected Impact Effort Notes
F102 map/filter chunked processing lazy-seq alloc MEDIUM CW range is eager; deferred
F103 Escape analysis GC overhead HIGH Compiler detects local-only
F104 Profile-guided IC extension 2x polymorphic MEDIUM Beyond monomorphic IC
Generational GC 2-5x allocation HIGH Write barriers required
SmallVector (inline 2-3 elts) Small vec alloc MEDIUM NaN boxing extension
Closure stack allocation 1-2 capture MEDIUM Avoid heap for small closures

Wasm-side (in zwasm repository)

Technique Expected Impact Notes
Register-based IR 1.5-3x all Major rewrite of zwasm interpreter
ARM64 JIT for Wasm 5-20x Reuse CW JIT PoC (D87) patterns
Constant folding / DCE 5-10% Low ROI for current benchmarks

LOW PRIORITY

ID Technique Notes
F99 Iterative lazy-seq realize D74 partial fix, general case left
F4 HAMT persistent vectors Current ArrayMap sufficient
F120 Native SIMD (@Vector) Profile first before investing

DECIDED-AGAINST

Technique Reason
wasmtime-as-library +20MB binary, Rust dep. Keep zwasm (small, controlled)
Tail-call dispatch 0% improvement on Apple M4 (45.3 measured)
RRB-Tree vectors Vectors rarely sliced in practice
cached_memory (Wasm) ROI < 1% in benchmarks
SmallString widening asString() returns []const u8 — lifetime problem
String interning string_ops bottleneck is alloc, not comparison

COMPLETED (Phase 36.11+37+45)

Technique Phase Result
F101 into() transient 36.11 core.clj transient/persistent!
F105 JIT compilation (PoC) 37.4 ARM64 hot loops, arith_loop 17.7x
Predecoded IR (Wasm) 45.2 Fixed-width 8-byte instrs, 1.7-2.5x
Superinstructions (Wasm) 45.4 11 fused opcodes, fib 1.3x
Cached memory pointer (Wasm) 45.5 Marginal (~3% on sieve)

4. Wasm Performance (Post-zwasm v0.1.0, Register IR + ARM64 JIT)

Benchmark CW warm (ms) wasmtime (ms) Ratio vs Phase 45
fib(20)x10K 641 211 3.0x 6.8x
tak(18,12,6)x10K 2,786 1,174 2.4x 5.1x
arith(1M)x10 0.8 0.1 8.0x
sieve(64K)x100 21 4.8 4.4x 9.4x
fib_loop(25)x1M 22 2.2 10.0x 8.0x
gcd(1M,700K)x1M 54 41 1.3x 5.8x

CW startup (4.1ms) < wasmtime startup (5.5ms). zwasm Register IR + ARM64 JIT brings most benchmarks within 3-10x of wasmtime. gcd achieves near-parity (1.3x). Call-heavy workloads (fib, tak) at 2.4-3.0x. Full history: bench/wasm_history.yaml.


References

Topic Location
Benchmark history bench/history.yaml
Wasm bench history bench/wasm_history.yaml
Cross-language script bench/compare_langs.sh
Cross-language results bench/cross-lang-results.yaml
D85 NaN boxing .dev/decisions.md
D87 JIT PoC .dev/decisions.md
D92 zwasm integration .dev/decisions.md
Checklist items .dev/checklist.md
zwasm repository ../zwasm/ or GitHub