import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886
Merged
northdpole merged 63 commits intomainfrom Apr 20, 2026
Merged
import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886northdpole merged 63 commits intomainfrom
northdpole merged 63 commits intomainfrom
Conversation
Follow-up on the ISO spreadsheet mock so parity tests match parser output.
Add importer-local Family/Subtype/Audience/Maturity enums and tag builder/validator in base_parser_defs Update all external parsers to emit family/subtype/source/audience/maturity tags, including training apps like Juice Shop Enforce tag presence via validate_classification_tags and extend parser tests to assert literal tag strings as the external convention
…fixes - Add normalize_embeddings_content for stable cache comparison - Skip embedding generation when content unchanged (compare embeddings_content) - Check nodes before CRE to avoid spurious 'CRE does not exist' logs - Fix Playwright timeout: use TimeoutError from sync_api, not internal _api_types
- Skip malformed Neo4j nodes in gap analysis formatting instead of crashing - Add commit retry with backoff for sqlite 'database is locked'
- Add _ga_eligible check; GA only for non-Tool/non-Code - Add tests for GA skip (Tool) and GA run (taxonomy Standard)
- Add verify_checkpoint_2_incremental: delete every Nth Standard embedding + GA row, refill, verify - Fail if deleted keys not restored, count mismatch, untouched mutated - Exclude URL-backed embeddings from mutation check (upstream content can change) - Assert refilled GA matches upstream exactly - Add content-change and structure-change recalculation probes - Rebuild Neo4j from sqlite before GA refill for deterministic verification - Add checkpoint2_heroku_incremental_verify.sh for Heroku export + local verify
Build a name->ID map before object creation so CRE IDs resolve consistently across rows, and skip unresolved deferred CRE links with warnings instead of hard-failing parse.
Move standard-family parsing to dedicated subparsers and keep behavior aligned via an equivalence test against the previous parsing path.
Track import runs in the database, introduce reusable standards diff logic, and wire import-run creation into spreadsheet import flow with test coverage.
Avoid double-parsing mutated spreadsheet rows during checkpoint imports so spreadsheet-linked standards (ASVS/ISO/etc.) stay stable in golden diffs.
Make incremental embeddings sensitive to document metadata (including URL-fetched node content) and add tests to ensure metadata changes trigger re-embedding while unchanged metadata is a cache hit. Adds an import-path integration test to confirm re-registering identical standards does not call the embedding provider twice.
Add stable standard snapshot hashing to detect manual main-graph edits and flag conflicting change-set ops. Include forward-compatible changeset JSON parsing and checkpoint3 verifier + tests.
Extend `git.clone()` to support shallow clones with optional sparse checkout, and apply it to high-volume upstream repos (Cheat Sheets, Secure Headers, ZAP alerts, misc tools) to reduce IO and clone time.
Ensure master spreadsheet CREs and per-family standards are enriched with required classification tags so the central import pipeline can validate and apply them. Update spreadsheet and import-path tests to compare by stable ids and to mock RQ waiting.
Implement Phase 2 Step 3 admin endpoints to list import runs, fetch run details, and retrieve structured change sets. Gate endpoints behind login and an ADMIN_IMPORTS_ENABLED flag. Includes tests and a deterministic Phase2 checkpoint5 verifier.
- StagedChangeSet.apply_error and update_staged_change_set helper - apply_changeset(): dry-run, conflict guard, add/modify/remove Standard nodes - Idempotent when staging_status is already applied
- scripts/checkpoint_phase3_apply_verify.py: dry-run, apply, idempotency, conflict - unittest wrapper for CI
Treat worker shutdown as expected cleanup so import-all no longer reports false failures after successful imports.
Add staging lifecycle scripts that mirror production config/dyno shape, import sqlite through local postgres, configure domain+ACM, and safely teardown staging db/app resources.
Defer prompt_client import to chat/import handlers so web dynos avoid loading heavy ML dependencies at boot and reduce baseline memory use.
Introduce a stage-only parse-result entrypoint and cover end-to-end accept/apply idempotency so reviewed runs can be applied without mutating the graph during staging.
Support OpenCRE AI exchange CSV normalization and parsing into master-spreadsheet imports, wire new MITRE ATLAS/OWASP AI Exchange standards families, and expose a dedicated CLI flag with parser tests.
Ensure expected docker volumes exist before removal so local reset commands do not fail on fresh environments.
Register new resource families in the master spreadsheet mapping, hydrate CRE IDs from the DB with validation, and route per-standard parsers via dispatch. Extend myopencre handling and add unit tests for master spreadsheet and MyOpenCRE parsing. Made-with: Cursor
Add options and orchestration so full or staged imports match the updated pipeline and spreadsheet sources. Made-with: Cursor
Cover document_is_ga_eligible and resource_name_ga_eligible_in_db for import pipeline and post-apply GA gating. Made-with: Cursor
Relative paths like standards_cache.sqlite were passed through CMDConfig as sqlite:///standards_cache.sqlite. Flask-SQLAlchemy 3 anchors non-absolute sqlite URLs to app.instance_path, so CLI commands such as imports and --generate_embeddings could open the wrong database under instance/ instead of the working tree file. Resolve filesystem paths with os.path.abspath before building the SQLAlchemy URI so --cache_file follows the shell cwd.
The embedding pipeline receives a flat list of primary keys that includes both CRE rows and node rows. Callers need a cheap existence check on the node table before get_nodes(db_id=), which logs warnings when no row exists.
Branch node versus CRE ids using has_node_with_db_id() before get_nodes() so CRE keys are not probed as Standard nodes. When hyperlinks point at PDFs or Playwright reports a download, fetch bytes with requests and extract text with pypdf. If remote text is missing or cleans to empty, embed from stored node todict() fields instead of skipping the section. Add pypdf to requirements and tests for PDF helpers and URL fallback.
Some exports still store display names as "Title (NNN-NNN)" while the sheet uses the clean title. Treat names as equivalent when the suffix id matches the CRE external id instead of raising a conflict.
Append alert ids as AlertID:<id> so numeric identifiers are not ambiguous with other tag namespaces.
CRE_EXPORT_ONLY exits after copying Postgres into SQLite so operators can refresh a local sqlite file without rerunning importers. CRE_EXPORT_EMBEDDINGS_ONLY limits the copy to the embeddings table (full import path and export-only path). CRE_EXPORT_LOCAL_POSTGRES_ONLY (or --local-postgres-only) rejects non-loopback Postgres URLs to reduce the risk of targeting remote clusters by mistake. CLI flags mirror the env toggles: --export-only, --embeddings-only, --local-postgres-only. Optional CRE_EXPORT_SQLITE_PATH sets the sqlite output when using export-only.
Operators often keep vectors in local sqlite and need to stage them on local Postgres before pushing to a remote cluster. import-all.sh only copies Postgres into sqlite, so add a small tool that replaces the embeddings table on a destination Postgres from either sqlite or another Postgres instance, with optional guards for non-loopback URLs. Document the entry point from import-all.sh header comments.
Teach setup-heroku-staging.sh and push-local-postgres-to-heroku-staging.sh to support table-scoped cache sync during sqlite->local postgres->remote postgres workflows. New CLI selectors --embeddings and --gap_analysis (or both) map to SYNC_TABLES values. When no selector is provided, behavior remains full DB sync. Scoped mode dumps/restores only public.embeddings and/or public.gap_analysis_results and skips destructive full-schema reset.
Switch embeddings to an id-based primary key and allow nullable FK columns so rows without a CRE or node can be stored without violating constraints. Made-with: Cursor
Convert empty-string IDs and optional text fields to NULL during embeddings sync so Postgres foreign key checks behave consistently with SQLite exports. Made-with: Cursor
Run preload against local Postgres by default, auto-seed from sqlite when the target is empty, and use safer process lifecycle handling so GA workers and API use a consistent backend. Made-with: Cursor
Treat taxonomy risk-list standards as GA-eligible (including CAPEC) and expose a GA-only standards endpoint so the map-analysis dropdown only shows standards that can participate in GA jobs. Made-with: Cursor
Deduplicate in-flight GA jobs to avoid queue floods, add Redis-unavailable fallback handling, and enforce cache-only 404 behavior on Heroku when map analysis is missing. Made-with: Cursor
Add a dedicated GA backfill CLI + scripts that compute missing pairs from DB truth, and retire the old preload loop to avoid duplicate scheduling and stalled progress. Also rename the Heroku sync script to be app-agnostic and update docs/Make targets to match. Made-with: Cursor
Handle legacy or pre-existing primary key constraint names when migrating embeddings so bootstrap/sync paths don't fail on duplicate/mismatched PK assumptions. Made-with: Cursor
Allow setup-heroku-staging.sh to run teardown actions via --delete so staging resources can be removed from a single entrypoint. Keep bootstrap requirements strict in normal mode while relaxing them for delete mode. Made-with: Cursor
Prevent lint runs from traversing local virtualenv and workspace cache directories so formatting only touches project sources.
Avoid Flask CLI context teardown failures during test execution by running the suite through unittest discovery after route validation.
Reintroduce legacy spreadsheet parser entrypoints and missing import-diff helpers, then align import telemetry and test expectations with the refactored parser and stricter CRE validation paths.
Use the active interpreter and repository-relative cwd so the subprocess test works in CI runners and local environments.
Apply Black formatting across the modified Python files so Super-Linter PYTHON_BLACK passes consistently in CI.
Drop checkpoint-only scripts, helpers, and tests tied to a retired import validation flow so CI no longer carries dead maintenance surface.
Run a separate frontend lint job in GitHub Actions using prettier --check on frontend source files so style regressions are caught in CI.
Apply consistent prettier formatting in modified frontend components and routes to reduce churn and keep the branch style-clean.
Handle transient npm registry 502 failures in CI by retrying frontend dependency installation with timeout and backoff.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
map_analysishandling, queue dedupe, and a dedicated missing-pair backfill flow replacing fragile preload polling.Why
Scope Highlights
Import Pipeline & Data Model
import_diff,import_apply,import_pipeline,import_post_apply, telemetry/impact helpers) for clearer stage/apply boundaries.GA Eligibility, Scheduling, and API Behavior
/rest/v1/map_analysisbehavior:Scripts & Operational Maturity
Makefile+READMEinstructions to match the new operational flow.DB/Migrations
Tests
Risks / Notes for Reviewers
Test Plan
make test) and verify import/GA suites pass.make import-allmake backfill-gap-analysismap_analysisbehavior for known cache-hit and cache-miss cases.--delete.Rollout / Ops