Skip to content

Performance fixes#53

Merged
jourdain merged 4 commits intoKitware:masterfrom
berkgeveci:performance-fixes
Apr 22, 2026
Merged

Performance fixes#53
jourdain merged 4 commits intoKitware:masterfrom
berkgeveci:performance-fixes

Conversation

@berkgeveci
Copy link
Copy Markdown
Collaborator

Four commits, all in src/e3sm_quickview/plugins/eam_projection.py. Fixes a longstanding issue where every time-slider tick re-projected the entire ~100M-point mesh (~14 s) because EAMExtract's unconditional Modified() on its no-crop pass-through was invalidating EAMProject's points cache. The fix is three-part: (1) EAMExtract now only bumps the points MTime on the transition out of a trimmed state; (2) EAMProject re-keys its cache from MTime to (id(input_points), project, translate), making it robust to spurious upstream Modified(); (3) the pedigree-indexed copy in add_cell_arrays replaces the 25M-element fancy numpy index with a cached slice plan that exploits the fact that the clip-induced pedigree permutation is 99.95 % monotonic-by-1 in long runs. Finally, the one-time projection itself (pyproj.Transformer.transform) is now fanned out across a ThreadPoolExecutor — pyproj releases the GIL and scales nearly linearly; thread count defaults to cpu_count - 1 and can be overridden via QV_PROJECTION_THREADS. Net: steady-state slider tick cost on the ne1024pg2 grid drops from ~14 800 ms to ~283 ms (~52×); the initial pipeline build drops from ~17.5 s to ~5.7 s and the per-drag crop-change cost from ~14 s to ~3 s.

EAMProject was re-projecting all input points through pyproj on every
pipeline update, adding ~14 seconds per timestep on the ne1024 grid
(25M points). The input geometry doesn't actually change when only the
scalar variable or time index changes, so the reprojection was wasted
work.

Two related fixes:

- EAMExtract no longer bumps its shared points' MTime on every no-crop
  pass-through; it only invalidates when transitioning out of a trimmed
  state. The unconditional Modified() was defeating EAMProject's cache
  on every pipeline update.

- EAMProject now keys its cache on the identity of the input vtkPoints
  object (plus the projection/translate parameters) rather than MTime.
  This makes the cache immune to spurious upstream Modified() calls on
  the shared points — a cleaner guarantee than chasing every filter
  that might bump MTime.

End-to-end pipeline cost on a time-slider tick (ne1024pg2, 25M cells,
2 variables enabled) drops from ~14,800 ms to ~200 ms.
EAMCenterMeridian's cached output path rebuilds each timestep's cell data
by fancy-indexing the input scalars through a PedigreeIds permutation
produced by vtkTableBasedClipDataSet. The permutation is almost entirely
long runs of +1-stepped indices (~99.95% in our case, a handful of
breaks at the seam every ~2048 cells), so the fancy index was doing far
more work than necessary.

Replace with a slice plan: one pass over PedigreeIds to identify the
monotonic runs, then on each tick execute `len(runs)` slice copies
(`out[s:e] = in[pid[s]:pid[s]+e-s]`). The plan is cached on the
pedigree array's (id, MTime) so it only runs once per meridian change.

Also write directly into the output vtkDataArray's buffer instead of
going through dsa's __setitem__, which was doing a full numpy → VTK
wrap round-trip.

End-to-end: EAMCenterMeridian's cached-path cost drops from ~40 ms to
~22 ms per tick with 2 variables enabled on the ne1024pg2 grid.
The Mollweide/Robinson projection ran pyproj.Transformer.transform as a
single-threaded call on ~100M points, taking ~14 seconds on the ne1024pg2
grid. That cost dominated the initial pipeline build and, worse, the
per-drag cost when the user changed the crop region (every crop change
invalidates EAMProject's cache because EAMExtract compacts points via
RemoveGhostCells).

pyproj releases the GIL inside transform(), so chunking the input and
farming the chunks out to a ThreadPoolExecutor scales nearly linearly:
in a standalone 100M-point bench we saw 1x / 2.0x / 3.9x / 7.4x for 1,
2, 4, 8 threads. Max_err between any pair of thread counts is 0 — the
chunking produces bit-identical output.

Capped at 8 threads empirically (speedup flattens there on a 10-core
machine, and leaving a couple cores free keeps the UI responsive).
Falls back to single-threaded for small inputs below 1M points where
the pool setup overhead would dominate.

Measured on ne1024pg2 (~100M points):
  initial full-pipeline update   17.5 s  ->  5.7 s
  crop-drag total cost           13   s  ->  3   s
The previous commit hard-capped EAMProject's ThreadPoolExecutor at 8
threads with a comment arguing that's where the speedup flattens on a
10-core laptop. On a bigger machine that leaves real perf on the table,
so take two steps:

- Default: max(1, cpu_count - 1), leaving one core for the UI/IO thread.
- Override: QV_PROJECTION_THREADS env var for HPC nodes or to pin a
  specific thread count (e.g. for testing).
@berkgeveci berkgeveci requested a review from jourdain April 22, 2026 17:57
@jourdain jourdain merged commit f8d6068 into Kitware:master Apr 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants