CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
- URL: http://arxiv.org/abs/2505.12057v1
- Date: Sat, 17 May 2025 15:39:39 GMT
- Title: CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
- Authors: Jing Zou, Qingqiu Li, Chenyu Lian, Lihao Liu, Xiaohan Yan, Shujun Wang, Jing Qin,
- Abstract summary: CorBenchX is a suite for automated error detection and correction in chest X-ray reports.<n>We first synthesize a large-scale dataset of 26,326 chest X-ray error reports.<n>We benchmark both open- and closed-source vision-language models.
- Score: 11.731590131260424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.
Related papers
- Generative Large Language Models Trained for Detecting Errors in Radiology Reports [11.852981889270012]
This dataset includes 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts.<n>Several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies.<n>Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall.
arXiv Detail & Related papers (2025-04-06T03:02:36Z) - Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model [1.7064514726335305]
We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients.<n>We incorporated uncertainty-aware prompt ensembles and an agent-based decision model.
arXiv Detail & Related papers (2025-02-02T16:57:03Z) - Anatomically-Grounded Fact Checking of Automated Chest X-ray Reports [0.0]
We propose a novel model for explainable fact-checking that identifies errors in findings and their locations indicated through the reports.<n>We evaluate the resulting fact-checking model and its utility in correcting reports generated by several SOTA automated reporting tools.
arXiv Detail & Related papers (2024-12-03T05:21:42Z) - Exploring Multimodal Large Language Models for Radiology Report
Error-checking [1.7217842380976978]
This paper proposes one of the first clinical applications of multimodal large language models (LLMs) as an assistant for radiologists to check errors in their reports.
We created an evaluation dataset from real-world radiology datasets (including X-rays and CT scans)
At the SIMPLE level, our fine-tuned model significantly enhanced performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
arXiv Detail & Related papers (2023-12-20T15:20:33Z) - Proximity-Informed Calibration for Deep Neural Networks [49.330703634912915]
ProCal is a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity.
We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings.
arXiv Detail & Related papers (2023-06-07T16:40:51Z) - Learning to diagnose common thorax diseases on chest radiographs from
radiology reports in Vietnamese [0.33598755777055367]
We propose a data collecting and annotation pipeline that extracts information from Vietnamese radiology reports to provide accurate labels for chest X-ray (CXR) images.
This can benefit Vietnamese radiologists and clinicians by annotating data that closely match their endemic diagnosis categories which may vary from country to country.
arXiv Detail & Related papers (2022-09-11T06:06:03Z) - Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine.
These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults.
We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Chest x-ray automated triage: a semiologic approach designed for
clinical implementation, exploiting different types of labels through a
combination of four Deep Learning architectures [83.48996461770017]
This work presents a Deep Learning method based on the late fusion of different convolutional architectures.
We built four training datasets combining images from public chest x-ray datasets and our institutional archive.
We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool.
arXiv Detail & Related papers (2020-12-23T14:38:35Z) - Collaborative Boundary-aware Context Encoding Networks for Error Map
Prediction [65.44752447868626]
We propose collaborative boundaryaware context encoding networks called AEP-Net for error prediction task.
Specifically, we propose a collaborative feature transformation branch for better feature fusion between images and masks, and precise localization of error regions.
The AEP-Net achieves an average DSC of 0.8358, 0.8164 for error prediction task, and shows a high Pearson correlation coefficient of 0.9873.
arXiv Detail & Related papers (2020-06-25T12:42:01Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.