Performance fixes by berkgeveci · Pull Request #53 · Kitware/QuickView

berkgeveci · 2026-04-22T17:57:22Z

Four commits, all in src/e3sm_quickview/plugins/eam_projection.py. Fixes a longstanding issue where every time-slider tick re-projected the entire ~100M-point mesh (~14 s) because EAMExtract's unconditional Modified() on its no-crop pass-through was invalidating EAMProject's points cache. The fix is three-part: (1) EAMExtract now only bumps the points MTime on the transition out of a trimmed state; (2) EAMProject re-keys its cache from MTime to (id(input_points), project, translate), making it robust to spurious upstream Modified(); (3) the pedigree-indexed copy in add_cell_arrays replaces the 25M-element fancy numpy index with a cached slice plan that exploits the fact that the clip-induced pedigree permutation is 99.95 % monotonic-by-1 in long runs. Finally, the one-time projection itself (pyproj.Transformer.transform) is now fanned out across a ThreadPoolExecutor — pyproj releases the GIL and scales nearly linearly; thread count defaults to cpu_count - 1 and can be overridden via QV_PROJECTION_THREADS. Net: steady-state slider tick cost on the ne1024pg2 grid drops from ~14 800 ms to ~283 ms (~52×); the initial pipeline build drops from ~17.5 s to ~5.7 s and the per-drag crop-change cost from ~14 s to ~3 s.

EAMProject was re-projecting all input points through pyproj on every pipeline update, adding ~14 seconds per timestep on the ne1024 grid (25M points). The input geometry doesn't actually change when only the scalar variable or time index changes, so the reprojection was wasted work. Two related fixes: - EAMExtract no longer bumps its shared points' MTime on every no-crop pass-through; it only invalidates when transitioning out of a trimmed state. The unconditional Modified() was defeating EAMProject's cache on every pipeline update. - EAMProject now keys its cache on the identity of the input vtkPoints object (plus the projection/translate parameters) rather than MTime. This makes the cache immune to spurious upstream Modified() calls on the shared points — a cleaner guarantee than chasing every filter that might bump MTime. End-to-end pipeline cost on a time-slider tick (ne1024pg2, 25M cells, 2 variables enabled) drops from ~14,800 ms to ~200 ms.

EAMCenterMeridian's cached output path rebuilds each timestep's cell data by fancy-indexing the input scalars through a PedigreeIds permutation produced by vtkTableBasedClipDataSet. The permutation is almost entirely long runs of +1-stepped indices (~99.95% in our case, a handful of breaks at the seam every ~2048 cells), so the fancy index was doing far more work than necessary. Replace with a slice plan: one pass over PedigreeIds to identify the monotonic runs, then on each tick execute `len(runs)` slice copies (`out[s:e] = in[pid[s]:pid[s]+e-s]`). The plan is cached on the pedigree array's (id, MTime) so it only runs once per meridian change. Also write directly into the output vtkDataArray's buffer instead of going through dsa's __setitem__, which was doing a full numpy → VTK wrap round-trip. End-to-end: EAMCenterMeridian's cached-path cost drops from ~40 ms to ~22 ms per tick with 2 variables enabled on the ne1024pg2 grid.

The Mollweide/Robinson projection ran pyproj.Transformer.transform as a single-threaded call on ~100M points, taking ~14 seconds on the ne1024pg2 grid. That cost dominated the initial pipeline build and, worse, the per-drag cost when the user changed the crop region (every crop change invalidates EAMProject's cache because EAMExtract compacts points via RemoveGhostCells). pyproj releases the GIL inside transform(), so chunking the input and farming the chunks out to a ThreadPoolExecutor scales nearly linearly: in a standalone 100M-point bench we saw 1x / 2.0x / 3.9x / 7.4x for 1, 2, 4, 8 threads. Max_err between any pair of thread counts is 0 — the chunking produces bit-identical output. Capped at 8 threads empirically (speedup flattens there on a 10-core machine, and leaving a couple cores free keeps the UI responsive). Falls back to single-threaded for small inputs below 1M points where the pool setup overhead would dominate. Measured on ne1024pg2 (~100M points): initial full-pipeline update 17.5 s -> 5.7 s crop-drag total cost 13 s -> 3 s

The previous commit hard-capped EAMProject's ThreadPoolExecutor at 8 threads with a comment arguing that's where the speedup flattens on a 10-core laptop. On a bigger machine that leaves real perf on the table, so take two steps: - Default: max(1, cpu_count - 1), leaving one core for the UI/IO thread. - Override: QV_PROJECTION_THREADS env var for HPC nodes or to pin a specific thread count (e.g. for testing).

berkgeveci added 4 commits April 22, 2026 13:11

berkgeveci requested a review from jourdain April 22, 2026 17:57

jourdain merged commit f8d6068 into Kitware:master Apr 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance fixes#53

Performance fixes#53
jourdain merged 4 commits intoKitware:masterfrom
berkgeveci:performance-fixes

berkgeveci commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

berkgeveci commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants