Add PersistentProgramCache (sqlite + filestream backends) by cpcloud · Pull Request #1912 · NVIDIA/cuda-python

cpcloud · 2026-04-14T22:00:23Z

Summary

Convert cuda.core.utils from a module to a package; expose cache APIs lazily via __getattr__ so from cuda.core.utils import StridedMemoryView stays lightweight.
Add ProgramCacheResource ABC with bytes | str keys, context manager, pickle-safety warning, and rejection of path-backed ObjectCode at write time.
Add make_program_cache_key() — blake2b(32) digest with backend-specific gates that mirror Program/Linker:
- Versions: cuda-core, NVRTC (c++), libNVVM lib+IR (nvvm), linker backend+version (ptx); driver only on the cuLink path.
- Validates code_type/target_type against Program.compile's SUPPORTED_TARGETS; rejects bytes-like code for non-NVVM and extra_sources for non-NVVM.
- NVRTC side-effect (create_pch, time, fdevice_time_trace) and external-content (include_path, pre_include, pch, use_pch, pch_dir) options require extra_digest; NVVM use_libdevice=True likewise.
- PTX (Linker) options pass through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options; ptxas_options canonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time, ptxas_options, split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma collapse under driver linker.
- Failed env probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and repeated calls.
Add SQLiteProgramCache — single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap, wal_checkpoint(TRUNCATE) + VACUUM after evictions so the cap bounds real on-disk usage. __contains__ is read-only; __len__ validates and prunes corrupt rows. threading.RLock serialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty; OperationalError (lock/busy) propagates without nuking the file (and closes the partial connection).
Add FileStreamProgramCache — multi-process via tmp + os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded (snapshot (ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer's os.replace is preserved. Stale temp files swept on open; live temps count toward the size cap. Windows ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionError and all POSIX failures propagate. __len__ also rejects stored_key/path mismatch.

Program.compile(cache=...) integration is out of scope (tracked by #176/#179).

Test plan

177 cache tests — single-process CRUD; LRU/size-cap (logical and on-disk); corruption + __len__ pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError); lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity test that parses _program.pyx via tokenize + ast.literal_eval.
End-to-end: real CUDA C++ compile → store in cache → reopen → get_kernel on the deserialised ObjectCode, parametrized over both backends.
CI: clean across all platforms.

Closes #178

github-actions · 2026-04-14T22:32:37Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1912/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

Convert cuda.core.utils to a package and add persistent, on-disk caches for compiled ObjectCode produced by Program.compile. Public API (cuda.core.utils): * ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping with context manager and pickle-safety warning. Path-backed ObjectCode is rejected at write time (would store only the path). * SQLiteProgramCache -- single-file sqlite3 backend (WAL mode, autocommit) with LRU eviction against an optional size cap. A threading.RLock serialises connection use so one cache object is safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after evictions so the size cap bounds real on-disk usage. __contains__ is read-only -- it does not bump LRU. __len__ counts only entries that survive validation and prunes corrupt rows. Schema-version mismatch on open drops the tables and rebuilds; corrupt / non-SQLite files are detected and the cache reinitialises empty. Transient OperationalError ("database is locked") propagates without nuking the file (and closes the partial connection). * FileStreamProgramCache -- directory of atomically-written entries (tmp + os.replace) safe across concurrent processes. On-disk filenames are blake2b(32) hashes of the key so arbitrary-length keys never overflow filesystem name limits. Reader pruning is stat-guarded: only delete a corrupt-looking file if its inode/ size/mtime have not changed since the read, so a concurrent os.replace by a writer is preserved. clear() and _enforce_size_cap use the same stat guard. Stale temp files (older than 1 hour) are swept on open and during eviction; live temp files count toward the size cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionErrors and all POSIX failures propagate. __len__ matches __getitem__ semantics (rejects schema/key/value mismatch). * make_program_cache_key -- stable 32-byte blake2b key over code, code_type, ProgramOptions, target_type, name expressions, cuda core/NVRTC versions, NVVM lib+IR version, linker backend+version for PTX inputs (driver version included only on the cuLink path). Backend-specific gates mirror Program/Linker: * code_type lower-cased to match Program_init. * code_type/target_type combination validated against Program's SUPPORTED_TARGETS matrix. * NVRTC side-effect options (create_pch, time, fdevice_time_trace) and external-content options (include_path, pre_include, pch, use_pch, pch_dir) require an extra_digest from the caller. The per-field set/unset predicate (_option_is_set) mirrors the compiler's emission gates; collections.abc.Sequence is the is_sequence check, matching _prepare_nvrtc_options_impl. * NVVM use_libdevice=True requires extra_digest because libdevice bitcode comes from the active toolkit. extra_sources is rejected for non-NVVM. Bytes-like ``code`` is rejected for non-NVVM (Program() requires str there). * PTX (Linker) input options are normalised through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse to a sentinel under the driver linker (it ignores them). ptxas_options canonicalises across str/list/tuple/empty shapes. The driver linker's hard rejections (time, ptxas_options, split_compile) raise at key time. * name_expressions are gated on backend == "nvrtc"; PTX/NVVM ignore them, matching Program.compile. * Failed environment probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and across repeated calls within a process. Lazy import: ``from cuda.core.utils import StridedMemoryView`` does NOT pull in the cache backends. The cache classes are exposed via module __getattr__. sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the package is usable on interpreters built without libsqlite3. Tests: 177 cache tests covering single-process CRUD, LRU/size-cap (logical and on-disk, including stat-guarded race scenarios), corruption + __len__ pruning, schema-mismatch table-DROP, threaded SQLite, cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError), lazy-import subprocess test, an end-to-end test that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens the cache, and calls get_kernel on the deserialised copy, and a test that parses _program.pyx via tokenize + ast.literal_eval to assert the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's matrix. Public API is documented in cuda_core/docs/source/api.rst.

…rce-tree shadow

… avoid scheduler-race flake

…IED (winerror 5)

…r; ensure threads exit before cache close

…-write (Windows sharing race)

cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026

cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026

cpcloud self-assigned this Apr 14, 2026

cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15

cpcloud force-pushed the persistent-program-cache-178 branch 22 times, most recently from 5887554 to 1b24442 Compare April 19, 2026 12:58

cpcloud force-pushed the persistent-program-cache-178 branch from f1ae40e to b27ed2c Compare April 19, 2026 13:28

cpcloud added 6 commits April 19, 2026 10:05

fixup! feat(core.utils): run subprocess from neutral cwd to avoid sou…

463af75

…rce-tree shadow

fixup! test(core.utils): rewriter does one final uncontested write to…

106bf74

… avoid scheduler-race flake

fixup! feat(core.utils): retry os.replace on Windows ERROR_ACCESS_DEN…

0625eb8

…IED (winerror 5)

fixup! feat(core.utils): retry read on Windows sharing PermissionErro…

c378eb7

…r; ensure threads exit before cache close

fixup! feat(core.utils): also retry read on Windows EACCES (errno 13)

9b6da1b

fixup! test(core.utils): suppress PermissionError on reader's corrupt…

812ce0f

…-write (Windows sharing race)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PersistentProgramCache (sqlite + filestream backends)#1912

Add PersistentProgramCache (sqlite + filestream backends)#1912
cpcloud wants to merge 7 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

cpcloud commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpcloud commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpcloud commented Apr 14, 2026 •

edited

Loading