Exploring Multimodal Large Language Models for Radiology Report
Error-checking
- URL: http://arxiv.org/abs/2312.13103v2
- Date: Sun, 3 Mar 2024 21:06:33 GMT
- Title: Exploring Multimodal Large Language Models for Radiology Report
Error-checking
- Authors: Jinge Wu, Yunsoo Kim, Eva C. Keller, Jamie Chow, Adam P. Levine,
Nikolas Pontikos, Zina Ibrahim, Paul Taylor, Michelle C. Williams, Honghan Wu
- Abstract summary: This paper proposes one of the first clinical applications of multimodal large language models (LLMs) as an assistant for radiologists to check errors in their reports.
We created an evaluation dataset from real-world radiology datasets (including X-rays and CT scans)
At the SIMPLE level, our fine-tuned model significantly enhanced performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
- Score: 1.7217842380976978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes one of the first clinical applications of multimodal
large language models (LLMs) as an assistant for radiologists to check errors
in their reports. We created an evaluation dataset from real-world radiology
datasets (including X-rays and CT scans). A subset of original reports was
modified to contain synthetic errors by introducing three types of mistakes:
"insert", "remove", and "substitute". The evaluation contained two difficulty
levels: SIMPLE for binary error-checking and COMPLEX for identifying error
types. At the SIMPLE level, our fine-tuned model significantly enhanced
performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively.
This performance boost is also observed in unseen modality, CT scans, as the
model performed 19.46% better than the baseline model. The model also surpassed
the domain expert's accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among
the subsets (N=21) of the test set where a clinician did not achieve the
correct conclusion, the LLaVA ensemble mode correctly identified 71.4% of these
cases. However, all models performed poorly in identifying mistake types,
underscoring the difficulty of the COMPLEX level. This study marks a promising
step toward utilizing multimodal LLMs to enhance diagnostic accuracy in
radiology. The ensemble model demonstrated comparable performance to
clinicians, even capturing errors overlooked by humans.
Related papers
- CXR-LLAVA: a multimodal large language model for interpreting chest
X-ray images [3.0757789554622597]
This study aimed to develop an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXRs)
For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities.
The model's diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists.
arXiv Detail & Related papers (2023-10-22T06:22:37Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - An Evaluation of Machine Learning Approaches for Early Diagnosis of
Autism Spectrum Disorder [0.0]
Autistic Spectrum Disorder (ASD) is a neurological disease characterized by difficulties with social interaction, communication, and repetitive activities.
This study employs diverse machine learning methods to identify crucial ASD traits, aiming to enhance and automate the diagnostic process.
arXiv Detail & Related papers (2023-09-20T21:23:37Z) - Annotating and Detecting Fine-grained Factual Errors for Dialogue
Summarization [34.85353544844499]
We present the first dataset with fine-grained factual error annotations named DIASUMFACT.
We define fine-grained factual error detection as a sentence-level multi-label classification problem.
We propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models.
arXiv Detail & Related papers (2023-05-26T00:18:33Z) - Improving Deep Facial Phenotyping for Ultra-rare Disorder Verification
Using Model Ensembles [52.77024349608834]
We analyze the influence of replacing a DCNN with a state-of-the-art face recognition approach, iResNet with ArcFace.
Our proposed ensemble model achieves state-of-the-art performance on both seen and unseen disorders.
arXiv Detail & Related papers (2022-11-12T23:28:54Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - Are Ensemble Classifiers Powerful Enough for the Detection and Diagnosis
of Intermediate-Severity Faults? [9.1591191545173]
Intermediate-Severity (IS) faults present milder symptoms compared to severe faults.
The lack of IS fault examples in the training data can pose severe risks to Fault Detection and Diagnosis (FDD) methods.
We discuss how to design more effective ensemble models for detecting and diagnosing IS faults.
arXiv Detail & Related papers (2020-07-07T02:05:04Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.