Exploring the Boundaries of GPT-4 in Radiology
- URL: http://arxiv.org/abs/2310.14573v1
- Date: Mon, 23 Oct 2023 05:13:03 GMT
- Title: Exploring the Boundaries of GPT-4 in Radiology
- Authors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C.
Castro, Maria Teodora Wetscherek, Robert Tinn, Harshita Sharma, Fernando
P\'erez-Garc\'ia, Anton Schwaighofer, Pranav Rajpurkar, Sameer Tajdin Khanna,
Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya V. Nori, Matthew P. Lungren,
Ozan Oktay, Javier Alvarez-Valle
- Abstract summary: GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context.
For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
- Score: 46.30976153809968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of general-domain large language models (LLMs) has
significantly changed the natural language processing paradigm towards a
unified foundation model across domains and applications. In this paper, we
focus on assessing the performance of GPT-4, the most capable LLM so far, on
the text-based applications for radiology reports, comparing against
state-of-the-art (SOTA) radiology-specific models. Exploring various prompting
strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and
we found GPT-4 either outperforms or is on par with current SOTA radiology
models. With zero-shot prompting, GPT-4 already obtains substantial gains
($\approx$ 10% absolute improvement) over radiology models in temporal sentence
similarity classification (accuracy) and natural language inference ($F_1$).
For tasks that require learning dataset-specific style or schema (e.g. findings
summarisation), GPT-4 improves with example-based prompting and matches
supervised SOTA. Our extensive error analysis with a board-certified
radiologist shows GPT-4 has a sufficient level of radiology knowledge with only
occasional errors in complex context that require nuanced domain knowledge. For
findings summarisation, GPT-4 outputs are found to be overall comparable with
existing manually-written impressions.
Related papers
- Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports [1.5972172622800358]
This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports.
evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1%, which is on par with GPT-4.
arXiv Detail & Related papers (2024-10-11T20:16:25Z) - BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports [9.739220217225435]
This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports.
We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it.
Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4's performance but also offers cost reductions and enhanced data privacy.
arXiv Detail & Related papers (2024-08-21T04:33:05Z) - GPT-4V Cannot Generate Radiology Reports Yet [25.331936045860516]
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing.
We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics.
arXiv Detail & Related papers (2024-07-16T21:03:14Z) - Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs)
Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports.
Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning.
Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z) - Radiology-GPT: A Large Language Model for Radiology [74.07944784968372]
We introduce Radiology-GPT, a large language model for radiology.
It demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA.
It exhibits significant versatility in radiological diagnosis, research, and communication.
arXiv Detail & Related papers (2023-06-14T17:57:24Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.