Add CheXpert Plus dataset, RadiologyKGExtraction task, and ReXKG model#1009
Add CheXpert Plus dataset, RadiologyKGExtraction task, and ReXKG model#1009Sterilistic wants to merge 9 commits intosunlabuiuc:masterfrom
Conversation
Implements the full ReXKG pipeline as a PyHealth contribution:
- pyhealth/datasets/chexpert_plus.py: CheXpertPlusDataset (BaseDataset)
Maps path_to_image as patient_id; exposes section_findings and related
fields for downstream NLP/KG extraction tasks.
- pyhealth/datasets/configs/chexpert_plus.yaml: YAML table config for
CheXpertPlusDataset (table: chexpert_plus).
- pyhealth/tasks/rexkg_extraction.py: RadiologyKGExtractionTask (BaseTask)
Produces {text, entities, relations} samples from CheXpert Plus reports.
Supports findings-only or findings+impression modes.
- pyhealth/models/rexkg.py: ReXKGModel (BaseModel)
PURE-based span NER head + pairwise relation extraction head on a shared
BERT encoder. Includes build_kg() for end-to-end KG construction and
save_kg() for JSON serialisation.
- tests/test_chexpert_plus.py, tests/test_rexkg.py: unit tests for all
three components.
Paper: ReXKG (https://arxiv.org/abs/2408.14397)
Dataset: CheXpert Plus (https://arxiv.org/abs/2405.19111)
Authors: Aaron Miller (aaronm6), Kathryn Thompson (kyt3), Pushpendra Tiwari (pkt3)
There was a problem hiding this comment.
Pull request overview
This PR adds radiology knowledge-graph extraction support to PyHealth by introducing the CheXpert Plus dataset wrapper, a RadiologyKGExtraction task that yields report text for extraction, and a ReXKGModel that performs span-based NER, pairwise relation extraction, and KG construction utilities.
Changes:
- Added
CheXpertPlusDataset+ YAML config to load CheXpert Plus report fields via the BaseDataset table pipeline. - Added
RadiologyKGExtractionTaskto generate{patient_id, text, entities, relations}samples for KG extraction workflows. - Added
ReXKGModelwith NER/RE inference helpers plusbuild_kg()/save_kg()and corresponding tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
pyhealth/datasets/chexpert_plus.py |
New dataset class for CheXpert Plus, including CSV presence verification and default task wiring. |
pyhealth/datasets/configs/chexpert_plus.yaml |
Dataset table config defining file path, patient_id column, and extracted text fields. |
pyhealth/tasks/rexkg_extraction.py |
New task to build extraction samples from CheXpert Plus events. |
pyhealth/tasks/__init__.py |
Exposes RadiologyKGExtractionTask from the tasks package. |
pyhealth/models/rexkg.py |
New ReXKG extraction model with NER/RE heads and KG builder/serializer. |
pyhealth/models/__init__.py |
Exposes ReXKGModel from the models package. |
pyhealth/datasets/__init__.py |
Exposes CheXpertPlusDataset from the datasets package. |
tests/test_chexpert_plus.py |
Adds dataset smoke tests (init, default task, ids, stats). |
tests/test_rexkg.py |
Adds model/task tests for label maps, inference outputs, KG builder, and save. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you guys)) But I am not Kathryn Thompson. |
- pyhealth/tasks/rexkg_extraction.py:
* Fix AttributeError: use event['key'] bracket access instead of event.get()
* Fix wrong key prefix: 'chexpert_plus/section_findings' -> 'section_findings'
* Fix schema processor types: 'str'/'sequence' -> 'raw'
- examples/cxr/chexpert_plus_rexkg.ipynb:
* Migrate openai v0 API (ChatCompletion.create, global api_key) to v1
(openai.OpenAI client, client.chat.completions.create, attribute access)
* Load pretrained BERT encoder from src/ner/result/run_entity/ checkpoint
with key remapping (bert.* -> encoder.*), strict=False
* Limit BERT demo to 2 reports with max_span_length=4 to prevent hang
* Trim KG visualization to top-20 most-connected nodes
* Clear cell outputs; redact hardcoded API key
* Add rexkg_cache/.gitignore to exclude runtime artifacts
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Checked, results and implementation with Team. Looks good. |
|
Reviewed code and PR request. |
|
@joshuasteier thanks for adding youself as a reviewer, would appreciate your review when you have time, ideally before the deadline. Thanks! |
This PR introduces end-to-end support for radiology knowledge graph extraction in PyHealth via integration of the CheXpert Plus dataset and the ReXKG pipeline. It adds a new dataset, task, and model, enabling structured knowledge extraction from chest X-ray reports.
Key Features
1. CheXpert Plus Dataset Integration
CheXpertPlusDataset(BaseDataset)path_to_image→patient_idsection_findings,impression, and related fields for downstream taskspyhealth/datasets/configs/chexpert_plus.yamlpyhealth.datasets2. Radiology Knowledge Graph Extraction Task
RadiologyKGExtractionTask(BaseTask){text, entities, relations}formatpyhealth.tasks3. ReXKG Model Integration
ReXKGModel(BaseModel)build_kg()for KG generationsave_kg()for JSON serializationpyhealth.modelsTesting
tests/test_chexpert_plus.pytests/test_rexkg.pyFiles Added
pyhealth/datasets/chexpert_plus.pypyhealth/datasets/configs/chexpert_plus.yamlpyhealth/tasks/rexkg_extraction.pypyhealth/models/rexkg.pytests/test_chexpert_plus.pytests/test_rexkg.pyReferences
Authors