import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows by northdpole · Pull Request #886 · OWASP/OpenCRE

northdpole · 2026-04-20T16:28:11Z

Summary

Refactors the import workflow into a stronger staged/apply model with snapshotting, change-set persistence, conflict handling, telemetry, and expanded test coverage.
Hardens gap-analysis behavior end to end: broader GA eligibility (including taxonomy/risk-list standards), safer map_analysis handling, queue dedupe, and a dedicated missing-pair backfill flow replacing fragile preload polling.
Improves operational tooling for local/staging/prod parity: Postgres-first import/backfill scripts, safer Heroku setup/teardown, and generalized DB sync script usage for staging or production.

Why

Existing import and GA flows could stall or become noisy under long runs, especially around queueing, fallback behavior, and environment differences.
Operational tasks (bootstrap, sync, teardown, cache population) needed deterministic, repeatable scripts with fewer hidden assumptions.
We needed a reliable path to compute and ship missing GA cache entries (notably for AI Exchange-related standards) without HTTP-loop fragility.

Scope Highlights

Import Pipeline & Data Model

Adds/expands import-run + staged change-set workflow and apply path.
Persists richer import metadata and diff artifacts.
Refactors pipeline internals (import_diff, import_apply, import_pipeline, import_post_apply, telemetry/impact helpers) for clearer stage/apply boundaries.
Improves parser normalization and CRE link derivation consistency.

GA Eligibility, Scheduling, and API Behavior

Broadens GA eligibility to include taxonomy/risk-list standards (e.g. CAPEC-class resources).
Updates GA dropdown/source filtering to only show GA-eligible standards.
Hardens /rest/v1/map_analysis behavior:
- cache-first safety,
- in-flight dedupe protections,
- safer queue/fallback flow,
- Heroku cache-miss safety path to avoid production 500s when backend dependencies are absent.
Replaces old preload-oriented GA strategy with CLI/scripted missing-pair backfill flow.

Scripts & Operational Maturity

Introduces/updates scripts for:
- GA missing-pair backfill,
- import-all Postgres workflows,
- embeddings/table sync and scoped sync paths,
- generalized local->Heroku Postgres push (app-agnostic naming),
- staging lifecycle management including teardown mode from setup entrypoint.
Updates Makefile + README instructions to match the new operational flow.

DB/Migrations

Adds surrogate-key/PK hardening work for embeddings.
Makes embeddings migration path more idempotent/robust on Postgres variants.

Tests

Adds and updates tests across:
- import staging/apply/diff/dashboard pathways,
- GA gating and pair scheduling behavior,
- admin import APIs,
- parser and web behavior touched by the refactor.

Risks / Notes for Reviewers

Large cross-cutting PR touching import engine, GA orchestration, scripts, tests, and migration behavior.
Some features were rebased/cherry-picked from long-lived branch history; commit subjects may differ from earlier iterations while preserving intended content.
Recommended review order:
1. DB/migration changes
2. import pipeline/apply modules
3. GA endpoint + eligibility changes
4. scripts + docs
5. tests

Test Plan

Run backend tests (make test) and verify import/GA suites pass.
Run lint/static checks used by CI.
Validate local import + GA backfill happy path:
- make import-all
- make backfill-gap-analysis
Validate map_analysis behavior for known cache-hit and cache-miss cases.
Validate staging setup/teardown script behavior including --delete.
Smoke-check AI Exchange + CAPEC GA cache entries on target env after sync.

Rollout / Ops

Deploy to staging first, run import/backfill flow, verify key GA pairs and API behavior.
Promote to production after parity checks pass.
Keep old operational commands/scripts retired per README/Makefile updates.

…un them

Follow-up on the ISO spreadsheet mock so parity tests match parser output.

Add importer-local Family/Subtype/Audience/Maturity enums and tag builder/validator in base_parser_defs Update all external parsers to emit family/subtype/source/audience/maturity tags, including training apps like Juice Shop Enforce tag presence via validate_classification_tags and extend parser tests to assert literal tag strings as the external convention

…fixes - Add normalize_embeddings_content for stable cache comparison - Skip embedding generation when content unchanged (compare embeddings_content) - Check nodes before CRE to avoid spurious 'CRE does not exist' logs - Fix Playwright timeout: use TimeoutError from sync_api, not internal _api_types

- Skip malformed Neo4j nodes in gap analysis formatting instead of crashing - Add commit retry with backoff for sqlite 'database is locked'

- Add _ga_eligible check; GA only for non-Tool/non-Code - Add tests for GA skip (Tool) and GA run (taxonomy Standard)

- Add verify_checkpoint_2_incremental: delete every Nth Standard embedding + GA row, refill, verify - Fail if deleted keys not restored, count mismatch, untouched mutated - Exclude URL-backed embeddings from mutation check (upstream content can change) - Assert refilled GA matches upstream exactly - Add content-change and structure-change recalculation probes - Rebuild Neo4j from sqlite before GA refill for deterministic verification - Add checkpoint2_heroku_incremental_verify.sh for Heroku export + local verify

Build a name->ID map before object creation so CRE IDs resolve consistently across rows, and skip unresolved deferred CRE links with warnings instead of hard-failing parse.

Move standard-family parsing to dedicated subparsers and keep behavior aligned via an equivalence test against the previous parsing path.

Track import runs in the database, introduce reusable standards diff logic, and wire import-run creation into spreadsheet import flow with test coverage.

Avoid double-parsing mutated spreadsheet rows during checkpoint imports so spreadsheet-linked standards (ASVS/ISO/etc.) stay stable in golden diffs.

Make incremental embeddings sensitive to document metadata (including URL-fetched node content) and add tests to ensure metadata changes trigger re-embedding while unchanged metadata is a cache hit. Adds an import-path integration test to confirm re-registering identical standards does not call the embedding provider twice.

Add stable standard snapshot hashing to detect manual main-graph edits and flag conflicting change-set ops. Include forward-compatible changeset JSON parsing and checkpoint3 verifier + tests.

Extend `git.clone()` to support shallow clones with optional sparse checkout, and apply it to high-volume upstream repos (Cheat Sheets, Secure Headers, ZAP alerts, misc tools) to reduce IO and clone time.

Ensure master spreadsheet CREs and per-family standards are enriched with required classification tags so the central import pipeline can validate and apply them. Update spreadsheet and import-path tests to compare by stable ids and to mock RQ waiting.

Implement Phase 2 Step 3 admin endpoints to list import runs, fetch run details, and retrieve structured change sets. Gate endpoints behind login and an ADMIN_IMPORTS_ENABLED flag. Includes tests and a deterministic Phase2 checkpoint5 verifier.

- StagedChangeSet.apply_error and update_staged_change_set helper - apply_changeset(): dry-run, conflict guard, add/modify/remove Standard nodes - Idempotent when staging_status is already applied

- scripts/checkpoint_phase3_apply_verify.py: dry-run, apply, idempotency, conflict - unittest wrapper for CI

…telemetry

Treat worker shutdown as expected cleanup so import-all no longer reports false failures after successful imports.

Add staging lifecycle scripts that mirror production config/dyno shape, import sqlite through local postgres, configure domain+ACM, and safely teardown staging db/app resources.

Defer prompt_client import to chat/import handlers so web dynos avoid loading heavy ML dependencies at boot and reduce baseline memory use.

Introduce a stage-only parse-result entrypoint and cover end-to-end accept/apply idempotency so reviewed runs can be applied without mutating the graph during staging.

Support OpenCRE AI exchange CSV normalization and parsing into master-spreadsheet imports, wire new MITRE ATLAS/OWASP AI Exchange standards families, and expose a dedicated CLI flag with parser tests.

Ensure expected docker volumes exist before removal so local reset commands do not fail on fresh environments.

Register new resource families in the master spreadsheet mapping, hydrate CRE IDs from the DB with validation, and route per-standard parsers via dispatch. Extend myopencre handling and add unit tests for master spreadsheet and MyOpenCRE parsing. Made-with: Cursor

Add options and orchestration so full or staged imports match the updated pipeline and spreadsheet sources. Made-with: Cursor

Cover document_is_ga_eligible and resource_name_ga_eligible_in_db for import pipeline and post-apply GA gating. Made-with: Cursor

Relative paths like standards_cache.sqlite were passed through CMDConfig as sqlite:///standards_cache.sqlite. Flask-SQLAlchemy 3 anchors non-absolute sqlite URLs to app.instance_path, so CLI commands such as imports and --generate_embeddings could open the wrong database under instance/ instead of the working tree file. Resolve filesystem paths with os.path.abspath before building the SQLAlchemy URI so --cache_file follows the shell cwd.

The embedding pipeline receives a flat list of primary keys that includes both CRE rows and node rows. Callers need a cheap existence check on the node table before get_nodes(db_id=), which logs warnings when no row exists.

Branch node versus CRE ids using has_node_with_db_id() before get_nodes() so CRE keys are not probed as Standard nodes. When hyperlinks point at PDFs or Playwright reports a download, fetch bytes with requests and extract text with pypdf. If remote text is missing or cleans to empty, embed from stored node todict() fields instead of skipping the section. Add pypdf to requirements and tests for PDF helpers and URL fallback.

Some exports still store display names as "Title (NNN-NNN)" while the sheet uses the clean title. Treat names as equivalent when the suffix id matches the CRE external id instead of raising a conflict.

Append alert ids as AlertID:<id> so numeric identifiers are not ambiguous with other tag namespaces.

CRE_EXPORT_ONLY exits after copying Postgres into SQLite so operators can refresh a local sqlite file without rerunning importers. CRE_EXPORT_EMBEDDINGS_ONLY limits the copy to the embeddings table (full import path and export-only path). CRE_EXPORT_LOCAL_POSTGRES_ONLY (or --local-postgres-only) rejects non-loopback Postgres URLs to reduce the risk of targeting remote clusters by mistake. CLI flags mirror the env toggles: --export-only, --embeddings-only, --local-postgres-only. Optional CRE_EXPORT_SQLITE_PATH sets the sqlite output when using export-only.

Operators often keep vectors in local sqlite and need to stage them on local Postgres before pushing to a remote cluster. import-all.sh only copies Postgres into sqlite, so add a small tool that replaces the embeddings table on a destination Postgres from either sqlite or another Postgres instance, with optional guards for non-loopback URLs. Document the entry point from import-all.sh header comments.

Teach setup-heroku-staging.sh and push-local-postgres-to-heroku-staging.sh to support table-scoped cache sync during sqlite->local postgres->remote postgres workflows. New CLI selectors --embeddings and --gap_analysis (or both) map to SYNC_TABLES values. When no selector is provided, behavior remains full DB sync. Scoped mode dumps/restores only public.embeddings and/or public.gap_analysis_results and skips destructive full-schema reset.

Switch embeddings to an id-based primary key and allow nullable FK columns so rows without a CRE or node can be stored without violating constraints. Made-with: Cursor

Convert empty-string IDs and optional text fields to NULL during embeddings sync so Postgres foreign key checks behave consistently with SQLite exports. Made-with: Cursor

Run preload against local Postgres by default, auto-seed from sqlite when the target is empty, and use safer process lifecycle handling so GA workers and API use a consistent backend. Made-with: Cursor

Treat taxonomy risk-list standards as GA-eligible (including CAPEC) and expose a GA-only standards endpoint so the map-analysis dropdown only shows standards that can participate in GA jobs. Made-with: Cursor

Deduplicate in-flight GA jobs to avoid queue floods, add Redis-unavailable fallback handling, and enforce cache-only 404 behavior on Heroku when map analysis is missing. Made-with: Cursor

Add a dedicated GA backfill CLI + scripts that compute missing pairs from DB truth, and retire the old preload loop to avoid duplicate scheduling and stalled progress. Also rename the Heroku sync script to be app-agnostic and update docs/Make targets to match. Made-with: Cursor

Handle legacy or pre-existing primary key constraint names when migrating embeddings so bootstrap/sync paths don't fail on duplicate/mismatched PK assumptions. Made-with: Cursor

Allow setup-heroku-staging.sh to run teardown actions via --delete so staging resources can be removed from a single entrypoint. Keep bootstrap requirements strict in normal mode while relaxing them for delete mode. Made-with: Cursor

Prevent lint runs from traversing local virtualenv and workspace cache directories so formatting only touches project sources.

Avoid Flask CLI context teardown failures during test execution by running the suite through unittest discovery after route validation.

Reintroduce legacy spreadsheet parser entrypoints and missing import-diff helpers, then align import telemetry and test expectations with the refactored parser and stricter CRE validation paths.

Use the active interpreter and repository-relative cwd so the subprocess test works in CI runners and local environments.

Apply Black formatting across the modified Python files so Super-Linter PYTHON_BLACK passes consistently in CI.

Drop checkpoint-only scripts, helpers, and tests tied to a retired import validation flow so CI no longer carries dead maintenance surface.

Run a separate frontend lint job in GitHub Actions using prettier --check on frontend source files so style regressions are caught in CI.

Apply consistent prettier formatting in modified frontend components and routes to reduce churn and keep the branch style-clean.

Handle transient npm registry 502 failures in CI by retrying frontend dependency installation with timeout and backoff.

northdpole added 30 commits April 20, 2026 16:58

mock iso test so we don't need a spreadsheet service account key to r…

9ee9b0f

…un them

test(spreadsheet): align ISO mock with parity expectations

2fb3549

Follow-up on the ISO spreadsheet mock so parity tests match parser output.

Harden AI clients with batching and rate-limit-only retries

c0cfeb4

fix(db): safe GA parse for malformed nodes, add_embedding sqlite retry

3413261

- Skip malformed Neo4j nodes in gap analysis formatting instead of crashing - Add commit retry with backoff for sqlite 'database is locked'

feat(cre_main): skip GA for Tool and Code resources (Step 3)

21d0abb

- Add _ga_eligible check; GA only for non-Tool/non-Code - Add tests for GA skip (Tool) and GA run (taxonomy Standard)

Improve spreadsheet CRE ID resolution and deferred-link reconciliation

7f36180

Build a name->ID map before object creation so CRE IDs resolve consistently across rows, and skip unresolved deferred CRE links with warnings instead of hard-failing parse.

Refactor spreadsheet standard parsing into subparsers with parity test

b08d842

Move standard-family parsing to dedicated subparsers and keep behavior aligned via an equivalence test against the previous parsing path.

Add import run metadata and standards diff utility

4327435

Track import runs in the database, introduce reusable standards diff logic, and wire import-run creation into spreadsheet import flow with test coverage.

refactor: migrate spreadsheet parsing into external parser modules

a86d35a

refactor: split master spreadsheet standards parsing per family

ea68486

Avoid double-parsing mutated spreadsheet rows during checkpoint imports so spreadsheet-linked standards (ASVS/ISO/etc.) stay stable in golden diffs.

feat(import-diff): detect manual edits and conflicts

1887e3f

Add stable standard snapshot hashing to detect manual main-graph edits and flag conflicting change-set ops. Include forward-compatible changeset JSON parsing and checkpoint3 verifier + tests.

perf(import): use sparse checkout for large upstream repos

eb1fa6e

Extend `git.clone()` to support shallow clones with optional sparse checkout, and apply it to high-volume upstream repos (Cheat Sheets, Secure Headers, ZAP alerts, misc tools) to reduce IO and clone time.

feat(import): apply engine for accepted staged change sets

b878bc2

- StagedChangeSet.apply_error and update_staged_change_set helper - apply_changeset(): dry-run, conflict guard, add/modify/remove Standard nodes - Idempotent when staging_status is already applied

test(import): Phase 3 apply readiness checkpoint verifier

bd9d78d

- scripts/checkpoint_phase3_apply_verify.py: dry-run, apply, idempotency, conflict - unittest wrapper for CI

fix(import): consistently derive CRE links from tag relationships

6ae963d

refactor(import): improve apply pipeline, graph impact tracking, and …

e807517

…telemetry

feat(app): wire updated import workflow through CLI and web entrypoints

aca24f8

test(import): cover staging and import parity behavior

6a8bd49

fix(scripts): make import-all teardown deterministic

9faaa54

Treat worker shutdown as expected cleanup so import-all no longer reports false failures after successful imports.

feat(scripts): automate heroku staging setup and teardown

01d79be

Add staging lifecycle scripts that mirror production config/dyno shape, import sqlite through local postgres, configure domain+ACM, and safely teardown staging db/app resources.

perf(web): lazy-load prompt client in web handlers

82ddb9f

Defer prompt_client import to chat/import handlers so web dynos avoid loading heavy ML dependencies at boot and reduce baseline memory use.

feat(import): add stage-only pipeline path for deferred apply

3464646

Introduce a stage-only parse-result entrypoint and cover end-to-end accept/apply idempotency so reviewed runs can be applied without mutating the graph during staging.

feat(import): add AI exchange CSV ingestion path

299f1e3

Support OpenCRE AI exchange CSV normalization and parsing into master-spreadsheet imports, wire new MITRE ATLAS/OWASP AI Exchange standards families, and expose a dedicated CLI flag with parser tests.

fix(devops): guard neo4j volume removal in cleanup target

80c1b1c

Ensure expected docker volumes exist before removal so local reset commands do not fail on fresh environments.

northdpole added 28 commits April 20, 2026 17:10

Extend import-all.sh for broader import workflows

cc3611d

Add options and orchestration so full or staged imports match the updated pipeline and spreadsheet sources. Made-with: Cursor

test(cre_main): add GA eligibility helper unit tests

8016b93

Cover document_is_ga_eligible and resource_name_ga_eligible_in_db for import pipeline and post-apply GA gating. Made-with: Cursor

db: add Node_collection.has_node_with_db_id()

db5c393

The embedding pipeline receives a flat list of primary keys that includes both CRE rows and node rows. Callers need a cheap existence check on the node table before get_nodes(db_id=), which logs warnings when no row exists.

master_spreadsheet: accept legacy CRE titles with parenthetical id

4f2e189

Some exports still store display names as "Title (NNN-NNN)" while the sheet uses the clean title. Treat names as equivalent when the suffix id matches the CRE external id instead of raising a conflict.

zap: namespace alert ids in tool tags

280f976

Append alert ids as AlertID:<id> so numeric identifiers are not ambiguous with other tag namespaces.

db: add surrogate key for embeddings rows

6afb3da

Switch embeddings to an id-based primary key and allow nullable FK columns so rows without a CRE or node can be stored without violating constraints. Made-with: Cursor

scripts: normalize blank embeddings foreign keys

75f480b

Convert empty-string IDs and optional text fields to NULL during embeddings sync so Postgres foreign key checks behave consistently with SQLite exports. Made-with: Cursor

preload: default gap preload to Postgres with optional sqlite seeding

89bf50d

Run preload against local Postgres by default, auto-seed from sqlite when the target is empty, and use safer process lifecycle handling so GA workers and API use a consistent backend. Made-with: Cursor

ga: include taxonomy standards and filter GA dropdown options

3f3aad2

Treat taxonomy risk-list standards as GA-eligible (including CAPEC) and expose a GA-only standards endpoint so the map-analysis dropdown only shows standards that can participate in GA jobs. Made-with: Cursor

web: harden map_analysis queue and cache behavior

850f8e7

Deduplicate in-flight GA jobs to avoid queue floods, add Redis-unavailable fallback handling, and enforce cache-only 404 behavior on Heroku when map analysis is missing. Made-with: Cursor

migrations: make embeddings PK transition idempotent on Postgres

62ea871

Handle legacy or pre-existing primary key constraint names when migrating embeddings so bootstrap/sync paths don't fail on duplicate/mismatched PK assumptions. Made-with: Cursor

scripts: add --delete mode to setup-heroku-staging

de41847

Allow setup-heroku-staging.sh to run teardown actions via --delete so staging resources can be removed from a single entrypoint. Keep bootstrap requirements strict in normal mode while relaxing them for delete mode. Made-with: Cursor

lint: exclude local env and cache dirs from prettier

5acb835

Prevent lint runs from traversing local virtualenv and workspace cache directories so formatting only touches project sources.

test: run unittest discovery in make target

22bc417

Avoid Flask CLI context teardown failures during test execution by running the suite through unittest discovery after route validation.

fix: restore parser compatibility and stabilize import/test regressions

d771007

Reintroduce legacy spreadsheet parser entrypoints and missing import-diff helpers, then align import telemetry and test expectations with the refactored parser and stricter CRE validation paths.

tests: make checkpoint step3 entrypoint portable

2c8ec80

Use the active interpreter and repository-relative cwd so the subprocess test works in CI runners and local environments.

style: reformat python sources with black

6b03ba5

Apply Black formatting across the modified Python files so Super-Linter PYTHON_BLACK passes consistently in CI.

cleanup: remove obsolete checkpoint verification scaffolding

f716810

Drop checkpoint-only scripts, helpers, and tests tied to a retired import validation flow so CI no longer carries dead maintenance surface.

ci: add dedicated frontend formatting check job

518c89a

Run a separate frontend lint job in GitHub Actions using prettier --check on frontend source files so style regressions are caught in CI.

style: normalize frontend formatting across touched pages

a4fb14a

Apply consistent prettier formatting in modified frontend components and routes to reduce churn and keep the branch style-clean.

ci: retry yarn install in frontend lint job

d25af58

Handle transient npm registry 502 failures in CI by retrying frontend dependency installation with timeout and backoff.

northdpole merged commit 175e883 into main Apr 20, 2026
7 checks passed

northdpole deleted the cre-import-maturity-spyros-only branch April 20, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886

import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886
northdpole merged 63 commits intomainfrom
cre-import-maturity-spyros-only

northdpole commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

northdpole commented Apr 20, 2026

Summary

Why

Scope Highlights

Import Pipeline & Data Model

GA Eligibility, Scheduling, and API Behavior

Scripts & Operational Maturity

DB/Migrations

Tests

Risks / Notes for Reviewers

Test Plan

Rollout / Ops

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant