Exploring Multimodal Large Language Models for Radiology Report
Error-checking
- URL: http://arxiv.org/abs/2312.13103v2
- Date: Sun, 3 Mar 2024 21:06:33 GMT
- Title: Exploring Multimodal Large Language Models for Radiology Report
Error-checking
- Authors: Jinge Wu, Yunsoo Kim, Eva C. Keller, Jamie Chow, Adam P. Levine,
Nikolas Pontikos, Zina Ibrahim, Paul Taylor, Michelle C. Williams, Honghan Wu
- Abstract summary: This paper proposes one of the first clinical applications of multimodal large language models (LLMs) as an assistant for radiologists to check errors in their reports.
We created an evaluation dataset from real-world radiology datasets (including X-rays and CT scans)
At the SIMPLE level, our fine-tuned model significantly enhanced performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
- Score: 1.7217842380976978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes one of the first clinical applications of multimodal
large language models (LLMs) as an assistant for radiologists to check errors
in their reports. We created an evaluation dataset from real-world radiology
datasets (including X-rays and CT scans). A subset of original reports was
modified to contain synthetic errors by introducing three types of mistakes:
"insert", "remove", and "substitute". The evaluation contained two difficulty
levels: SIMPLE for binary error-checking and COMPLEX for identifying error
types. At the SIMPLE level, our fine-tuned model significantly enhanced
performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
This performance boost is also observed in unseen modality, CT scans, as the
model performed 19.46% better than the baseline model. The model also surpassed
the domain expert's accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among
the subsets (N=21) of the test set where a clinician did not achieve the
correct conclusion, the LLaVA ensemble mode correctly identified 71.4% of these
cases. However, all models performed poorly in identifying mistake types,
underscoring the difficulty of the COMPLEX level. This study marks a promising
step toward utilizing multimodal LLMs to enhance diagnostic accuracy in
radiology. The ensemble model demonstrated comparable performance to
clinicians, even capturing errors overlooked by humans.
Related papers
- ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.
Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - Generative Large Language Models Trained for Detecting Errors in Radiology Reports [11.852981889270012]
This dataset includes 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts.
Several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies.
Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall.
arXiv Detail & Related papers (2025-04-06T03:02:36Z) - Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.
Current error classification methods rely on static and predefined categories.
We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation [20.173287130474797]
generative medical Vision Large Language Models (VLLMs) are prone to hallucinations and can produce inaccurate diagnostic information.
We introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties.
Our approach improves factuality scores by $10$%, achieved by rejecting $20$% of reports using the textttRadialog model on the MIMIC-CXR dataset.
arXiv Detail & Related papers (2024-12-05T20:43:39Z) - Anatomically-Grounded Fact Checking of Automated Chest X-ray Reports [0.0]
We propose a novel model for explainable fact-checking that identifies errors in findings and their locations indicated through the reports.
We evaluate the resulting fact-checking model and its utility in correcting reports generated by several SOTA automated reporting tools.
arXiv Detail & Related papers (2024-12-03T05:21:42Z) - Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease [0.7696359453385685]
This paper generates synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset.
Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset.
Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.
arXiv Detail & Related papers (2024-11-12T15:28:06Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports [1.9106067578277455]
We introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports.
We developed error categories that capture common mistakes in both human and AI-generated reports.
Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility.
arXiv Detail & Related papers (2024-09-17T01:42:39Z) - CXR-LLAVA: a multimodal large language model for interpreting chest
X-ray images [3.0757789554622597]
This study aimed to develop an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXRs)
For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities.
The model's diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists.
arXiv Detail & Related papers (2023-10-22T06:22:37Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - An Evaluation of Machine Learning Approaches for Early Diagnosis of
Autism Spectrum Disorder [0.0]
Autistic Spectrum Disorder (ASD) is a neurological disease characterized by difficulties with social interaction, communication, and repetitive activities.
This study employs diverse machine learning methods to identify crucial ASD traits, aiming to enhance and automate the diagnostic process.
arXiv Detail & Related papers (2023-09-20T21:23:37Z) - Annotating and Detecting Fine-grained Factual Errors for Dialogue
Summarization [34.85353544844499]
We present the first dataset with fine-grained factual error annotations named DIASUMFACT.
We define fine-grained factual error detection as a sentence-level multi-label classification problem.
We propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models.
arXiv Detail & Related papers (2023-05-26T00:18:33Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.