Skip to content

Add CheXpert Plus dataset, RadiologyKGExtraction task, and ReXKG model#1009

Open
Sterilistic wants to merge 9 commits intosunlabuiuc:masterfrom
Sterilistic:feature/rexkg-chexpert-plus
Open

Add CheXpert Plus dataset, RadiologyKGExtraction task, and ReXKG model#1009
Sterilistic wants to merge 9 commits intosunlabuiuc:masterfrom
Sterilistic:feature/rexkg-chexpert-plus

Conversation

@Sterilistic
Copy link
Copy Markdown

@Sterilistic Sterilistic commented Apr 18, 2026

This PR introduces end-to-end support for radiology knowledge graph extraction in PyHealth via integration of the CheXpert Plus dataset and the ReXKG pipeline. It adds a new dataset, task, and model, enabling structured knowledge extraction from chest X-ray reports.

Key Features

1. CheXpert Plus Dataset Integration

  • Added CheXpertPlusDataset (BaseDataset)
    • Supports loading and validation of CheXpert Plus chest X-ray reports
    • Maps path_to_imagepatient_id
    • Exposes section_findings, impression, and related fields for downstream tasks
  • Added dataset config:
    • pyhealth/datasets/configs/chexpert_plus.yaml
  • Registered dataset in pyhealth.datasets

2. Radiology Knowledge Graph Extraction Task

  • Added RadiologyKGExtractionTask (BaseTask)
    • Converts reports into {text, entities, relations} format
    • Supports:
      • Findings-only mode
      • Findings + impression mode
  • Registered in pyhealth.tasks

3. ReXKG Model Integration

  • Added ReXKGModel (BaseModel)
    • Shared BERT encoder
    • PURE-style span-based NER head
    • Pairwise relation extraction head
  • Capabilities:
    • Named Entity Recognition (NER)
    • Relation Extraction
    • End-to-end Knowledge Graph construction
  • Utilities:
    • build_kg() for KG generation
    • save_kg() for JSON serialization
  • Registered in pyhealth.models

Testing

  • tests/test_chexpert_plus.py
    • Dataset initialization
    • Patient ID parsing
    • Dataset statistics validation
  • tests/test_rexkg.py
    • NER + relation label mapping
    • Prediction outputs
    • Knowledge graph construction

Files Added

  • pyhealth/datasets/chexpert_plus.py
  • pyhealth/datasets/configs/chexpert_plus.yaml
  • pyhealth/tasks/rexkg_extraction.py
  • pyhealth/models/rexkg.py
  • tests/test_chexpert_plus.py
  • tests/test_rexkg.py

References

Authors

Implements the full ReXKG pipeline as a PyHealth contribution:

- pyhealth/datasets/chexpert_plus.py: CheXpertPlusDataset (BaseDataset)
  Maps path_to_image as patient_id; exposes section_findings and related
  fields for downstream NLP/KG extraction tasks.

- pyhealth/datasets/configs/chexpert_plus.yaml: YAML table config for
  CheXpertPlusDataset (table: chexpert_plus).

- pyhealth/tasks/rexkg_extraction.py: RadiologyKGExtractionTask (BaseTask)
  Produces {text, entities, relations} samples from CheXpert Plus reports.
  Supports findings-only or findings+impression modes.

- pyhealth/models/rexkg.py: ReXKGModel (BaseModel)
  PURE-based span NER head + pairwise relation extraction head on a shared
  BERT encoder. Includes build_kg() for end-to-end KG construction and
  save_kg() for JSON serialisation.

- tests/test_chexpert_plus.py, tests/test_rexkg.py: unit tests for all
  three components.

Paper: ReXKG (https://arxiv.org/abs/2408.14397)
Dataset: CheXpert Plus (https://arxiv.org/abs/2405.19111)

Authors: Aaron Miller (aaronm6), Kathryn Thompson (kyt3), Pushpendra Tiwari (pkt3)
Copilot AI review requested due to automatic review settings April 18, 2026 20:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds radiology knowledge-graph extraction support to PyHealth by introducing the CheXpert Plus dataset wrapper, a RadiologyKGExtraction task that yields report text for extraction, and a ReXKGModel that performs span-based NER, pairwise relation extraction, and KG construction utilities.

Changes:

  • Added CheXpertPlusDataset + YAML config to load CheXpert Plus report fields via the BaseDataset table pipeline.
  • Added RadiologyKGExtractionTask to generate {patient_id, text, entities, relations} samples for KG extraction workflows.
  • Added ReXKGModel with NER/RE inference helpers plus build_kg()/save_kg() and corresponding tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pyhealth/datasets/chexpert_plus.py New dataset class for CheXpert Plus, including CSV presence verification and default task wiring.
pyhealth/datasets/configs/chexpert_plus.yaml Dataset table config defining file path, patient_id column, and extracted text fields.
pyhealth/tasks/rexkg_extraction.py New task to build extraction samples from CheXpert Plus events.
pyhealth/tasks/__init__.py Exposes RadiologyKGExtractionTask from the tasks package.
pyhealth/models/rexkg.py New ReXKG extraction model with NER/RE heads and KG builder/serializer.
pyhealth/models/__init__.py Exposes ReXKGModel from the models package.
pyhealth/datasets/__init__.py Exposes CheXpertPlusDataset from the datasets package.
tests/test_chexpert_plus.py Adds dataset smoke tests (init, default task, ids, stats).
tests/test_rexkg.py Adds model/task tests for label maps, inference outputs, KG builder, and save.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyhealth/models/rexkg.py
Comment thread pyhealth/models/rexkg.py Outdated
Comment thread pyhealth/models/rexkg.py
Comment thread pyhealth/models/rexkg.py
Comment thread pyhealth/models/rexkg.py Outdated
Comment thread pyhealth/models/rexkg.py Outdated
Comment thread pyhealth/tasks/rexkg_extraction.py
Comment thread pyhealth/datasets/chexpert_plus.py
Comment thread tests/test_rexkg.py
Comment thread tests/test_rexkg.py Outdated
@kyt3
Copy link
Copy Markdown

kyt3 commented Apr 19, 2026

Thank you guys)) But I am not Kathryn Thompson.

@joshuasteier joshuasteier self-requested a review April 19, 2026 15:27
@Sterilistic Sterilistic marked this pull request as draft April 19, 2026 18:21
Sterilistic and others added 4 commits April 19, 2026 12:52
- pyhealth/tasks/rexkg_extraction.py:
  * Fix AttributeError: use event['key'] bracket access instead of event.get()
  * Fix wrong key prefix: 'chexpert_plus/section_findings' -> 'section_findings'
  * Fix schema processor types: 'str'/'sequence' -> 'raw'

- examples/cxr/chexpert_plus_rexkg.ipynb:
  * Migrate openai v0 API (ChatCompletion.create, global api_key) to v1
    (openai.OpenAI client, client.chat.completions.create, attribute access)
  * Load pretrained BERT encoder from src/ner/result/run_entity/ checkpoint
    with key remapping (bert.* -> encoder.*), strict=False
  * Limit BERT demo to 2 reports with max_span_length=4 to prevent hang
  * Trim KG visualization to top-20 most-connected nodes
  * Clear cell outputs; redact hardcoded API key
  * Add rexkg_cache/.gitignore to exclude runtime artifacts
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Sterilistic Sterilistic marked this pull request as ready for review April 19, 2026 22:22
@StrawHatAaron
Copy link
Copy Markdown

Checked, results and implementation with Team. Looks good.

@kyt325
Copy link
Copy Markdown

kyt325 commented Apr 19, 2026

Reviewed code and PR request.

@Sterilistic
Copy link
Copy Markdown
Author

Sterilistic commented Apr 19, 2026

@joshuasteier thanks for adding youself as a reviewer, would appreciate your review when you have time, ideally before the deadline. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants