Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support
- URL: http://arxiv.org/abs/2501.02346v1
- Date: Sat, 04 Jan 2025 17:57:33 GMT
- Title: Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support
- Authors: Florian Putz, Marlen Haderleina, Sebastian Lettmaier, Sabine Semrau, Rainer Fietkau, Yixing Huang,
- Abstract summary: An attempt to assess GPT-4's performance in radiation oncology was made via a dedicated 100-question examination.
GPT-4's performance on a broader field of clinical radiation oncology is benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam.
Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies.
- Score: 1.592751576537053
- License:
- Abstract: Thanks to the rapidly evolving integration of LLMs into decision-support tools, a significant transformation is happening across large-scale systems. Like other medical fields, the use of LLMs such as GPT-4 is gaining increasing interest in radiation oncology as well. An attempt to assess GPT-4's performance in radiation oncology was made via a dedicated 100-question examination on the highly specialized topic of radiation oncology physics, revealing GPT-4's superiority over other LLMs. GPT-4's performance on a broader field of clinical radiation oncology is further benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam where GPT-4 achieved a high accuracy of 74.57%. Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies. Such studies shed light on the potential of LLMs in radiation oncology. As interest in the potential and constraints of LLMs in general healthcare applications continues to rise5, the capabilities and limitations of LLMs in radiation oncology decision support have not yet been fully explored.
Related papers
- Can Modern LLMs Act as Agent Cores in Radiology Environments? [54.36730060680139]
Large language models (LLMs) offer enhanced accuracy and interpretability across various domains.
This paper aims to investigate the pre-requisite question for building concrete radiology agents.
We present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents.
Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow.
arXiv Detail & Related papers (2024-12-12T18:20:16Z) - Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback [10.826651024680169]
Radiologists play a crucial role by translating medical images into medical reports.
While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy.
We propose a scalable automated preference alignment technique for VLMs in radiology, focusing on chest X-ray (CXR) report generation.
arXiv Detail & Related papers (2024-10-09T16:07:11Z) - BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports [9.739220217225435]
This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports.
We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it.
Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4's performance but also offers cost reductions and enhanced data privacy.
arXiv Detail & Related papers (2024-08-21T04:33:05Z) - MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama.
It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2.
Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - RadioRAG: Factual large language models for enhanced diagnostics in radiology using online retrieval augmented generation [1.7618750189510493]
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets.
We have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time.
arXiv Detail & Related papers (2024-07-22T13:29:56Z) - Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context.
For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning.
Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z) - Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology [7.094683738932199]
We evaluate the performance of ChatGPT-4 in radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases.
For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively.
ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry.
arXiv Detail & Related papers (2023-04-24T09:50:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.