Related papers: Generative Large Language Models Trained for Detecting Errors in Radiology Reports

Generative Large Language Models Trained for Detecting Errors in Radiology Reports

URL: http://arxiv.org/abs/2504.04336v1
Date: Sun, 06 Apr 2025 03:02:36 GMT
Title: Generative Large Language Models Trained for Detecting Errors in Radiology Reports
Authors: Cong Sun, Kurt Teichman, Yiliang Zhou, Brian Critelli, David Nauheim, Graham Keir, Xindi Wang, Judy Zhong, Adam E Flanders, George Shih, Yifan Peng,
Abstract summary: This dataset includes 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts.<n>Several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies.<n>Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall.
Score: 11.852981889270012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95\% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.

Related papers

Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)<n>We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.<n>This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z)
Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation [20.173287130474797]
generative medical Vision Large Language Models (VLLMs) are prone to hallucinations and can produce inaccurate diagnostic information.<n>We introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties.<n>Our approach improves factuality scores by $10$%, achieved by rejecting $20$% of reports using the textttRadialog model on the MIMIC-CXR dataset.
arXiv Detail & Related papers (2024-12-05T20:43:39Z)
Anatomically-Grounded Fact Checking of Automated Chest X-ray Reports [0.0]
We propose a novel model for explainable fact-checking that identifies errors in findings and their locations indicated through the reports.<n>We evaluate the resulting fact-checking model and its utility in correcting reports generated by several SOTA automated reporting tools.
arXiv Detail & Related papers (2024-12-03T05:21:42Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE) CRTRE applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z)
ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports [1.9106067578277455]
We introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. We developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility.
arXiv Detail & Related papers (2024-09-17T01:42:39Z)
Exploring Multimodal Large Language Models for Radiology Report Error-checking [1.7217842380976978]
This paper proposes one of the first clinical applications of multimodal large language models (LLMs) as an assistant for radiologists to check errors in their reports. We created an evaluation dataset from real-world radiology datasets (including X-rays and CT scans) At the SIMPLE level, our fine-tuned model significantly enhanced performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
arXiv Detail & Related papers (2023-12-20T15:20:33Z)
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations. The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations. ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function. We then define a new curvature-based criterion for judging if a passage is generated from a given LLM. We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z)
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
Event-based clinical findings extraction from radiology reports with pre-trained language model [0.22940141855172028]
We present a new corpus of radiology reports annotated with clinical findings. The gold standard corpus contained a total of 500 annotated computed tomography (CT) reports. We extracted triggers and argument entities using two state-of-the-art deep learning architectures, including BERT.
arXiv Detail & Related papers (2021-12-27T05:03:10Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
CLARA: Clinical Report Auto-completion [56.206459591367405]
CLinicit Al it Report it Auto-completion (CLARA) is an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation.
arXiv Detail & Related papers (2020-02-26T18:45:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.