RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)
- URL: http://arxiv.org/abs/2601.15129v1
- Date: Wed, 21 Jan 2026 16:04:01 GMT
- Title: RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)
- Authors: Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello, Po-Hao Chen, Henrique Min Ho Lee, Gilberto Szarf, Hamilton Shoji, Jason Sho, Katherine Andriole, Tessa Cook, Lisa C. Adams, Linda C. Chu, Maggie Chung, Geraldine Brusca-Augello, Djeven P. Deva, Navneet Singh, Felipe Sanchez Tijmes, Jeffrey B. Alpert, Elsie T. Nguyen, Drew A. Torigian, Kate Hanneman, Lauren K Groner, Alexander Phan, Ali Islam, Matias F. Callejas, Gustavo Borges da Silva Teles, Faisal Jamal, Maryam Vazirabad, Ali Tejani, Hari Trivedi, Paulo Kuriki, Rajesh Bhayana, Elana T. Benishay, Yi Lin, Yifan Peng, George Shih,
- Abstract summary: GPT-4o extracted abnormal findings from 13,735 chest radiographs and their corresponding reports from the MIDRC.<n>1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review.<n>17 chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels.
- Score: 27.699008063855146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.
Related papers
- Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting [40.40577855417923]
We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray report generation.<n>A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement.<n>Our results suggest MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.
arXiv Detail & Related papers (2025-11-21T10:53:26Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Act Like a Radiologist: Radiology Report Generation across Anatomical Regions [50.13206214694885]
X-RGen is a radiologist-minded report generation framework across six anatomical regions.
In X-RGen, we seek to mimic the behaviour of human radiologists, breaking them down into four principal phases.
We enhance the recognition capacity of the image encoder by analysing images and reports across various regions.
arXiv Detail & Related papers (2023-05-26T07:12:35Z) - Advancing Radiograph Representation Learning with Masked Record Modeling [52.04899592688968]
We formulate the self- and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM)
MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations.
Specifically, we find that MRM offers superior performance in label-efficient fine-tuning.
arXiv Detail & Related papers (2023-01-30T18:33:32Z) - Auto-outlier Fusion Technique for Chest X-ray classification with
Multi-head Attention Mechanism [4.416665886445889]
A chest X-ray is one of the most widely available radiological examinations for diagnosing and detecting various lung illnesses.
The National Institutes of Health (NIH) provides an extensive database, ChestX-ray8 and ChestXray14, to help establish a deep learning community for analysing and predicting lung diseases.
arXiv Detail & Related papers (2022-11-15T09:35:49Z) - Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records.
The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z) - Radiomics-Guided Global-Local Transformer for Weakly Supervised
Pathology Localization in Chest X-Rays [65.88435151891369]
Radiomics-Guided Transformer (RGT) fuses textitglobal image information with textitlocal knowledge-guided radiomics information.
RGT consists of an image Transformer branch, a radiomics Transformer branch, and fusion layers that aggregate image and radiomic information.
arXiv Detail & Related papers (2022-07-10T06:32:56Z) - Robust Classification from Noisy Labels: Integrating Additional
Knowledge for Chest Radiography Abnormality Assessment [14.631388658828921]
The introduction of large-scale public datasets has led to a series of novel systems for automated abnormality classification.
We propose novel training strategies that handle label noise from such suboptimal data.
With an average AUC score of 0.880 across all abnormalities, our proposed training strategies can be used to significantly improve performance scores.
arXiv Detail & Related papers (2021-04-12T07:51:07Z) - VisualCheXbert: Addressing the Discrepancy Between Radiology Report
Labels and Image Labels [4.865330207715854]
We show that radiologists labeling reports significantly disagree with radiologists labeling chest X-ray images.
We develop and evaluate methods to produce labels from radiology reports that have better agreement with radiologists labeling images.
arXiv Detail & Related papers (2021-02-23T03:02:36Z) - Automated Radiological Report Generation For Chest X-Rays With
Weakly-Supervised End-to-End Deep Learning [17.315387269810426]
We built a database containing more than 12,000 CXR scans and radiological reports.
We developed a model based on deep convolutional neural network and recurrent network with attention mechanism.
The model provides automated recognition of given scans and generation of reports.
arXiv Detail & Related papers (2020-06-18T08:12:54Z) - Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale
Chest Computed Tomography Volumes [64.21642241351857]
We curated and analyzed a chest computed tomography (CT) data set of 36,316 volumes from 19,993 unique patients.
We developed a rule-based method for automatically extracting abnormality labels from free-text radiology reports.
We also developed a model for multi-organ, multi-disease classification of chest CT volumes.
arXiv Detail & Related papers (2020-02-12T00:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.