Related papers: A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection

A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection

URL: http://arxiv.org/abs/2506.20112v1
Date: Wed, 25 Jun 2025 04:02:29 GMT
Title: A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection
Authors: Songsoo Kim, Seungtae Lee, See Young Lee, Joonho Kim, Keechan Kan, Dukyong Yoon,
Abstract summary: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence.<n>A three-pass LLM framework significantly enhanced PPV and reduced operational costs.
Score: 1.8604092379196109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.

Related papers

A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection [45.33258156568741]
Foundation models for medical image segmentation struggle under out-of-distribution shifts.<n>We introduce R$2$Seg, a training-free framework for robust OOD tumor segmentation.
arXiv Detail & Related papers (2025-11-16T17:15:52Z)
Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports [12.808813933646407]
We introduce a sentence-level Reward Model (PRM) adapted for this vision-language task.<n>PRM predicts the factual correctness of each generated sentence conditioned on clinical context.<n>PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5%.
arXiv Detail & Related papers (2025-10-27T11:08:05Z)
Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis [7.41395379449452]
This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a zero-shot medical Vision-Language Model (VLM)<n>Our experiments show that supervised CNNs serve as highly competitive baselines in both cases.<n>By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets.
arXiv Detail & Related papers (2025-10-01T01:46:09Z)
Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems [1.1373722549440357]
We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis CT reports.<n>Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores.
arXiv Detail & Related papers (2025-06-03T18:00:08Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z)
Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV [49.1574468325115]
This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset.<n>The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer)
arXiv Detail & Related papers (2025-05-23T14:06:42Z)
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z)
ThyroidEffi 1.0: A Cost-Effective System for High-Performance Multi-Class Thyroid Carcinoma Classification [0.0]
We develop and validate a deep learning system for multi-class thyroid FNAB image classification.<n>Benign, Indeterminate/Suspicious, and Malignant are three key categories directly guiding post-biopsy treatment.<n>The system processed 1000 cases in 30 seconds, demonstrating feasibility on widely accessible hardware.
arXiv Detail & Related papers (2025-04-19T02:13:07Z)
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [49.0793012627959]
We present VAPO, a novel framework tailored for reasoning models within the value-based paradigm.<n>VAPO attains a state-of-the-art score of $mathbf60.4$.<n>In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points.
arXiv Detail & Related papers (2025-04-07T14:21:11Z)
Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters [16.74673750576054]
Pulmonary embolism registries accelerate practice improving research but rely on labor intensive manual abstraction of radiology reports.<n>We examined whether openly available large language models (LLMs) can automate concept extraction from computed tomography PE (CTPE) reports without loss of data quality.<n>Four Llama 3 variants (3.0 8B, 3.1 8B, 3.1 70B, 3.3 70B) and one reviewer model, Phi 4 14B, were tested on 250 annotated CTPE reports from each of MIMIC IV and Duke University.<n> Accuracy, positive predictive value (PPV) and negative predictive value (NPV) versus a human gold standard were measured across
arXiv Detail & Related papers (2025-03-26T21:38:06Z)
Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model [1.7064514726335305]
We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients.<n>We incorporated uncertainty-aware prompt ensembles and an agent-based decision model.
arXiv Detail & Related papers (2025-02-02T16:57:03Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE) CRTRE applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z)
Noisy probing dose facilitated dose prediction for pencil beam scanning proton therapy: physics enhances generalizability [18.852346492990637]
Prior AI-based dose prediction studies in photon and proton therapy often neglect underlying physics. Our aim is to design a physics-aware and generalizable AI-based PBSPT dose prediction method.
arXiv Detail & Related papers (2023-12-02T00:15:44Z)
Attention-based Saliency Maps Improve Interpretability of Pneumothorax Classification [52.77024349608834]
To investigate chest radiograph (CXR) classification performance of vision transformers (ViT) and interpretability of attention-based saliency. ViTs were fine-tuned for lung disease classification using four public data sets: CheXpert, Chest X-Ray 14, MIMIC CXR, and VinBigData. ViTs had comparable CXR classification AUCs compared with state-of-the-art CNNs.
arXiv Detail & Related papers (2023-03-03T12:05:41Z)
Controlling False Positive/Negative Rates for Deep-Learning-Based Prostate Cancer Detection on Multiparametric MR images [58.85481248101611]
We propose a novel PCa detection network that incorporates a lesion-level cost-sensitive loss and an additional slice-level loss based on a lesion-to-slice mapping function. Our experiments based on 290 clinical patients concludes that 1) The lesion-level FNR was effectively reduced from 0.19 to 0.10 and the lesion-level FPR was reduced from 1.03 to 0.66 by changing the lesion-level cost.
arXiv Detail & Related papers (2021-06-04T09:51:27Z)
Deep Learning Based Detection and Localization of Intracranial Aneurysms in Computed Tomography Angiography [5.973882600944421]
A two-step model was implemented: a 3D region proposal network for initial aneurysm detection and 3D DenseNetsfor false-positive reduction. Our model showed statistically higher accuracy, sensitivity, and specificity when compared to the available model at 0.25 FPPV and the best F-1 score.
arXiv Detail & Related papers (2020-05-22T10:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.