CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats
- URL: http://arxiv.org/abs/2405.19538v2
- Date: Mon, 3 Jun 2024 19:14:12 GMT
- Title: CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats
- Authors: Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, Curtis P. Langlotz,
- Abstract summary: CheXpert Plus is the largest text dataset publicly released in radiology.
It represents the largest text de-identification effort in radiology.
All reports are paired with high-quality images in DICOM format.
- Score: 18.498344743909254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus
Related papers
- ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports [4.247428746963443]
This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems.
By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis.
arXiv Detail & Related papers (2025-05-01T00:29:50Z) - UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.
UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.
We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z) - Shadow and Light: Digitally Reconstructed Radiographs for Disease Classification [8.192975020366777]
DRR-RATE comprises of 50,188 frontal Digitally Reconstructed Radiographs (DRRs) from 21,304 unique patients.
Each image is paired with a corresponding radiology text report and binary labels for 18 pathology classes.
We demonstrate the applicability of DRR-RATE alongside existing large-scale chest X-ray resources, notably the CheXpert dataset and CheXnet model.
arXiv Detail & Related papers (2024-06-06T02:19:18Z) - Revisiting Computer-Aided Tuberculosis Diagnosis [56.80999479735375]
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually.
Computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data.
We establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11K) dataset, which contains 11,200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas.
This dataset enables the training of sophisticated detectors for high-quality CTD.
arXiv Detail & Related papers (2023-07-06T08:27:48Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Act Like a Radiologist: Radiology Report Generation across Anatomical Regions [50.13206214694885]
X-RGen is a radiologist-minded report generation framework across six anatomical regions.
In X-RGen, we seek to mimic the behaviour of human radiologists, breaking them down into four principal phases.
We enhance the recognition capacity of the image encoder by analysing images and reports across various regions.
arXiv Detail & Related papers (2023-05-26T07:12:35Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Learning Invariant Feature Representation to Improve Generalization
across Chest X-ray Datasets [55.06983249986729]
We show that a deep learning model performing well when tested on the same dataset as training data starts to perform poorly when it is tested on a dataset from a different source.
By employing an adversarial training strategy, we show that a network can be forced to learn a source-invariant representation.
arXiv Detail & Related papers (2020-08-04T07:41:15Z) - COVID-19 Image Data Collection: Prospective Predictions Are the Future [12.81240882490576]
This dataset is the largest public resource for COVID-19 image and prognostic data.
It was manually aggregated from publication figures as well as various web based repositories into a machine learning friendly format.
We present multiple possible use cases for the data such as predicting the need for the ICU, predicting patient survival, and understanding a patient's trajectory during treatment.
arXiv Detail & Related papers (2020-06-22T03:20:36Z) - XRayGAN: Consistency-preserving Generation of X-ray Images from
Radiology Reports [19.360283053558604]
We develop methods to generate view-consistent, high-fidelity, and high-resolution X-ray images from radiology reports.
This work represents the first one generating consistent and high-resolution X-ray images from radiology reports.
arXiv Detail & Related papers (2020-06-17T05:32:14Z) - BIMCV COVID-19+: a large annotated dataset of RX and CT images from
COVID-19 patients [2.927469685126833]
This first iteration of the database includes 1,380 CX, 885 DX and 163 CT studies from 1,311 COVID-19+ patients.
This is, to the best of our knowledge, the largest COVID-19+ dataset of images available in an open format.
arXiv Detail & Related papers (2020-06-01T18:06:21Z) - Y-Net for Chest X-Ray Preprocessing: Simultaneous Classification of
Geometry and Segmentation of Annotations [70.0118756144807]
This work introduces a general pre-processing step for chest x-ray input into machine learning algorithms.
A modified Y-Net architecture based on the VGG11 encoder is used to simultaneously learn geometric orientation and segmentation of radiographs.
Results were evaluated by expert clinicians, with acceptable geometry in 95.8% and annotation mask in 96.2%, compared to 27.0% and 34.9% respectively in control images.
arXiv Detail & Related papers (2020-05-08T02:16:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.