Skip to content

import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886

Merged
northdpole merged 63 commits intomainfrom
cre-import-maturity-spyros-only
Apr 20, 2026
Merged

import/ga: harden staged import pipeline, GA backfill, and Heroku ops flows#886
northdpole merged 63 commits intomainfrom
cre-import-maturity-spyros-only

Conversation

@northdpole
Copy link
Copy Markdown
Collaborator

Summary

  • Refactors the import workflow into a stronger staged/apply model with snapshotting, change-set persistence, conflict handling, telemetry, and expanded test coverage.
  • Hardens gap-analysis behavior end to end: broader GA eligibility (including taxonomy/risk-list standards), safer map_analysis handling, queue dedupe, and a dedicated missing-pair backfill flow replacing fragile preload polling.
  • Improves operational tooling for local/staging/prod parity: Postgres-first import/backfill scripts, safer Heroku setup/teardown, and generalized DB sync script usage for staging or production.

Why

  • Existing import and GA flows could stall or become noisy under long runs, especially around queueing, fallback behavior, and environment differences.
  • Operational tasks (bootstrap, sync, teardown, cache population) needed deterministic, repeatable scripts with fewer hidden assumptions.
  • We needed a reliable path to compute and ship missing GA cache entries (notably for AI Exchange-related standards) without HTTP-loop fragility.

Scope Highlights

Import Pipeline & Data Model

  • Adds/expands import-run + staged change-set workflow and apply path.
  • Persists richer import metadata and diff artifacts.
  • Refactors pipeline internals (import_diff, import_apply, import_pipeline, import_post_apply, telemetry/impact helpers) for clearer stage/apply boundaries.
  • Improves parser normalization and CRE link derivation consistency.

GA Eligibility, Scheduling, and API Behavior

  • Broadens GA eligibility to include taxonomy/risk-list standards (e.g. CAPEC-class resources).
  • Updates GA dropdown/source filtering to only show GA-eligible standards.
  • Hardens /rest/v1/map_analysis behavior:
    • cache-first safety,
    • in-flight dedupe protections,
    • safer queue/fallback flow,
    • Heroku cache-miss safety path to avoid production 500s when backend dependencies are absent.
  • Replaces old preload-oriented GA strategy with CLI/scripted missing-pair backfill flow.

Scripts & Operational Maturity

  • Introduces/updates scripts for:
    • GA missing-pair backfill,
    • import-all Postgres workflows,
    • embeddings/table sync and scoped sync paths,
    • generalized local->Heroku Postgres push (app-agnostic naming),
    • staging lifecycle management including teardown mode from setup entrypoint.
  • Updates Makefile + README instructions to match the new operational flow.

DB/Migrations

  • Adds surrogate-key/PK hardening work for embeddings.
  • Makes embeddings migration path more idempotent/robust on Postgres variants.

Tests

  • Adds and updates tests across:
    • import staging/apply/diff/dashboard pathways,
    • GA gating and pair scheduling behavior,
    • admin import APIs,
    • parser and web behavior touched by the refactor.

Risks / Notes for Reviewers

  • Large cross-cutting PR touching import engine, GA orchestration, scripts, tests, and migration behavior.
  • Some features were rebased/cherry-picked from long-lived branch history; commit subjects may differ from earlier iterations while preserving intended content.
  • Recommended review order:
    1. DB/migration changes
    2. import pipeline/apply modules
    3. GA endpoint + eligibility changes
    4. scripts + docs
    5. tests

Test Plan

  • Run backend tests (make test) and verify import/GA suites pass.
  • Run lint/static checks used by CI.
  • Validate local import + GA backfill happy path:
    • make import-all
    • make backfill-gap-analysis
  • Validate map_analysis behavior for known cache-hit and cache-miss cases.
  • Validate staging setup/teardown script behavior including --delete.
  • Smoke-check AI Exchange + CAPEC GA cache entries on target env after sync.

Rollout / Ops

  • Deploy to staging first, run import/backfill flow, verify key GA pairs and API behavior.
  • Promote to production after parity checks pass.
  • Keep old operational commands/scripts retired per README/Makefile updates.

Follow-up on the ISO spreadsheet mock so parity tests match parser output.
Add importer-local Family/Subtype/Audience/Maturity enums and tag builder/validator in base_parser_defs
Update all external parsers to emit family/subtype/source/audience/maturity tags, including training apps like Juice Shop
Enforce tag presence via validate_classification_tags and extend parser tests to assert literal tag strings as the external convention
…fixes

- Add normalize_embeddings_content for stable cache comparison
- Skip embedding generation when content unchanged (compare embeddings_content)
- Check nodes before CRE to avoid spurious 'CRE does not exist' logs
- Fix Playwright timeout: use TimeoutError from sync_api, not internal _api_types
- Skip malformed Neo4j nodes in gap analysis formatting instead of crashing
- Add commit retry with backoff for sqlite 'database is locked'
- Add _ga_eligible check; GA only for non-Tool/non-Code
- Add tests for GA skip (Tool) and GA run (taxonomy Standard)
- Add verify_checkpoint_2_incremental: delete every Nth Standard embedding + GA row, refill, verify
- Fail if deleted keys not restored, count mismatch, untouched mutated
- Exclude URL-backed embeddings from mutation check (upstream content can change)
- Assert refilled GA matches upstream exactly
- Add content-change and structure-change recalculation probes
- Rebuild Neo4j from sqlite before GA refill for deterministic verification
- Add checkpoint2_heroku_incremental_verify.sh for Heroku export + local verify
Build a name->ID map before object creation so CRE IDs resolve consistently across rows, and skip unresolved deferred CRE links with warnings instead of hard-failing parse.
Move standard-family parsing to dedicated subparsers and keep behavior aligned via an equivalence test against the previous parsing path.
Track import runs in the database, introduce reusable standards diff logic, and wire import-run creation into spreadsheet import flow with test coverage.
Avoid double-parsing mutated spreadsheet rows during checkpoint imports so spreadsheet-linked standards (ASVS/ISO/etc.) stay stable in golden diffs.
Make incremental embeddings sensitive to document metadata (including URL-fetched node content) and add tests to ensure metadata changes trigger re-embedding while unchanged metadata is a cache hit. Adds an import-path integration test to confirm re-registering identical standards does not call the embedding provider twice.
Add stable standard snapshot hashing to detect manual main-graph edits and flag conflicting change-set ops. Include forward-compatible changeset JSON parsing and checkpoint3 verifier + tests.
Extend `git.clone()` to support shallow clones with optional sparse checkout, and apply it to high-volume upstream repos (Cheat Sheets, Secure Headers, ZAP alerts, misc tools) to reduce IO and clone time.
Ensure master spreadsheet CREs and per-family standards are enriched with required classification tags so the central import pipeline can validate and apply them.

Update spreadsheet and import-path tests to compare by stable ids and to mock RQ waiting.
Implement Phase 2 Step 3 admin endpoints to list import runs, fetch run details, and retrieve structured change sets. Gate endpoints behind login and an ADMIN_IMPORTS_ENABLED flag.
Includes tests and a deterministic Phase2 checkpoint5 verifier.
- StagedChangeSet.apply_error and update_staged_change_set helper
- apply_changeset(): dry-run, conflict guard, add/modify/remove Standard nodes
- Idempotent when staging_status is already applied
- scripts/checkpoint_phase3_apply_verify.py: dry-run, apply, idempotency, conflict
- unittest wrapper for CI
Treat worker shutdown as expected cleanup so import-all no longer reports false failures after successful imports.
Add staging lifecycle scripts that mirror production config/dyno shape, import sqlite through local postgres, configure domain+ACM, and safely teardown staging db/app resources.
Defer prompt_client import to chat/import handlers so web dynos avoid loading heavy ML dependencies at boot and reduce baseline memory use.
Introduce a stage-only parse-result entrypoint and cover end-to-end accept/apply idempotency so reviewed runs can be applied without mutating the graph during staging.
Support OpenCRE AI exchange CSV normalization and parsing into master-spreadsheet imports, wire new MITRE ATLAS/OWASP AI Exchange standards families, and expose a dedicated CLI flag with parser tests.
Ensure expected docker volumes exist before removal so local reset commands do not fail on fresh environments.
Register new resource families in the master spreadsheet mapping, hydrate CRE
IDs from the DB with validation, and route per-standard parsers via dispatch.
Extend myopencre handling and add unit tests for master spreadsheet and
MyOpenCRE parsing.

Made-with: Cursor
Add options and orchestration so full or staged imports match the updated
pipeline and spreadsheet sources.

Made-with: Cursor
Cover document_is_ga_eligible and resource_name_ga_eligible_in_db for import
pipeline and post-apply GA gating.

Made-with: Cursor
Relative paths like standards_cache.sqlite were passed through
CMDConfig as sqlite:///standards_cache.sqlite. Flask-SQLAlchemy 3
anchors non-absolute sqlite URLs to app.instance_path, so CLI
commands such as imports and --generate_embeddings could open the
wrong database under instance/ instead of the working tree file.

Resolve filesystem paths with os.path.abspath before building the
SQLAlchemy URI so --cache_file follows the shell cwd.
The embedding pipeline receives a flat list of primary keys that
includes both CRE rows and node rows. Callers need a cheap existence
check on the node table before get_nodes(db_id=), which logs warnings
when no row exists.
Branch node versus CRE ids using has_node_with_db_id() before
get_nodes() so CRE keys are not probed as Standard nodes.

When hyperlinks point at PDFs or Playwright reports a download,
fetch bytes with requests and extract text with pypdf.

If remote text is missing or cleans to empty, embed from stored
node todict() fields instead of skipping the section.

Add pypdf to requirements and tests for PDF helpers and URL fallback.
Some exports still store display names as "Title (NNN-NNN)" while the
sheet uses the clean title. Treat names as equivalent when the
suffix id matches the CRE external id instead of raising a conflict.
Append alert ids as AlertID:<id> so numeric identifiers are not
ambiguous with other tag namespaces.
CRE_EXPORT_ONLY exits after copying Postgres into SQLite so operators
can refresh a local sqlite file without rerunning importers.

CRE_EXPORT_EMBEDDINGS_ONLY limits the copy to the embeddings table
(full import path and export-only path).

CRE_EXPORT_LOCAL_POSTGRES_ONLY (or --local-postgres-only) rejects
non-loopback Postgres URLs to reduce the risk of targeting remote
clusters by mistake.

CLI flags mirror the env toggles: --export-only, --embeddings-only,
--local-postgres-only. Optional CRE_EXPORT_SQLITE_PATH sets the
sqlite output when using export-only.
Operators often keep vectors in local sqlite and need to stage them on
local Postgres before pushing to a remote cluster. import-all.sh only
copies Postgres into sqlite, so add a small tool that replaces the
embeddings table on a destination Postgres from either sqlite or
another Postgres instance, with optional guards for non-loopback URLs.

Document the entry point from import-all.sh header comments.
Teach setup-heroku-staging.sh and push-local-postgres-to-heroku-staging.sh
to support table-scoped cache sync during sqlite->local postgres->remote
postgres workflows.

New CLI selectors --embeddings and --gap_analysis (or both) map to
SYNC_TABLES values. When no selector is provided, behavior remains full
DB sync.

Scoped mode dumps/restores only public.embeddings and/or
public.gap_analysis_results and skips destructive full-schema reset.
Switch embeddings to an id-based primary key and allow nullable FK columns so rows without a CRE or node can be stored without violating constraints.

Made-with: Cursor
Convert empty-string IDs and optional text fields to NULL during embeddings sync so Postgres foreign key checks behave consistently with SQLite exports.

Made-with: Cursor
Run preload against local Postgres by default, auto-seed from sqlite when the target is empty, and use safer process lifecycle handling so GA workers and API use a consistent backend.

Made-with: Cursor
Treat taxonomy risk-list standards as GA-eligible (including CAPEC) and expose a GA-only standards endpoint so the map-analysis dropdown only shows standards that can participate in GA jobs.

Made-with: Cursor
Deduplicate in-flight GA jobs to avoid queue floods, add Redis-unavailable fallback handling, and enforce cache-only 404 behavior on Heroku when map analysis is missing.

Made-with: Cursor
Add a dedicated GA backfill CLI + scripts that compute missing pairs from DB truth, and retire the old preload loop to avoid duplicate scheduling and stalled progress. Also rename the Heroku sync script to be app-agnostic and update docs/Make targets to match.

Made-with: Cursor
Handle legacy or pre-existing primary key constraint names when migrating embeddings so bootstrap/sync paths don't fail on duplicate/mismatched PK assumptions.

Made-with: Cursor
Allow setup-heroku-staging.sh to run teardown actions via --delete so staging resources can be removed from a single entrypoint. Keep bootstrap requirements strict in normal mode while relaxing them for delete mode.

Made-with: Cursor
Prevent lint runs from traversing local virtualenv and workspace cache directories so formatting only touches project sources.
Avoid Flask CLI context teardown failures during test execution by running the suite through unittest discovery after route validation.
Reintroduce legacy spreadsheet parser entrypoints and missing import-diff helpers, then align import telemetry and test expectations with the refactored parser and stricter CRE validation paths.
Use the active interpreter and repository-relative cwd so the subprocess test works in CI runners and local environments.
Apply Black formatting across the modified Python files so Super-Linter PYTHON_BLACK passes consistently in CI.
Drop checkpoint-only scripts, helpers, and tests tied to a retired import validation flow so CI no longer carries dead maintenance surface.
Run a separate frontend lint job in GitHub Actions using prettier --check on frontend source files so style regressions are caught in CI.
Apply consistent prettier formatting in modified frontend components and routes to reduce churn and keep the branch style-clean.
Handle transient npm registry 502 failures in CI by retrying frontend dependency installation with timeout and backoff.
@northdpole northdpole merged commit 175e883 into main Apr 20, 2026
7 checks passed
@northdpole northdpole deleted the cre-import-maturity-spyros-only branch April 20, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant