Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology
- URL: http://arxiv.org/abs/2304.11957v4
- Date: Mon, 21 Aug 2023 09:20:48 GMT
- Title: Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology
- Authors: Yixing Huang, Ahmed Gomaa, Sabine Semrau, Marlen Haderlein, Sebastian
Lettmaier, Thomas Weissmann, Johanna Grigo, Hassen Ben Tkhayat, Benjamin
Frey, Udo S. Gaipl, Luitpold V. Distel, Andreas Maier, Rainer Fietkau,
Christoph Bert, and Florian Putz
- Abstract summary: We evaluate the performance of ChatGPT-4 in radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases.
For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively.
ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry.
- Score: 7.094683738932199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The potential of large language models in medicine for education and decision
making purposes has been demonstrated as they achieve decent scores on medical
exams such as the United States Medical Licensing Exam (USMLE) and the MedQA
exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized
field of radiation oncology using the 38th American College of Radiology (ACR)
radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone
cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of
63.65% and 74.57%, respectively, highlighting the advantage of the latest
ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in
radiation oncology are identified to some extent. Specifically, ChatGPT-4
demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology,
and physics than knowledge of bone & soft tissue and gynecology, as per the ACR
knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in
diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks
proficiency in in-depth details of clinical trials. For the Gray Zone cases,
ChatGPT-4 is able to suggest a personalized treatment approach to each case
with high correctness and comprehensiveness. Importantly, it provides novel
treatment aspects for many cases, which are not suggested by any human experts.
Both evaluations demonstrate the potential of ChatGPT-4 in medical education
for the general public and cancer patients, as well as the potential to aid
clinical decision-making, while acknowledging its limitations in certain
domains. Because of the risk of hallucination, facts provided by ChatGPT always
need to be verified.
Related papers
- Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support [1.592751576537053]
An attempt to assess GPT-4's performance in radiation oncology was made via a dedicated 100-question examination.
GPT-4's performance on a broader field of clinical radiation oncology is benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam.
Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies.
arXiv Detail & Related papers (2025-01-04T17:57:33Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment [6.321623278767821]
GPT-4V was recently developed for general image interpretation.
Board-certified physicians and senior residents assessed GPT-4V's proficiency across a range of medical conditions.
GPT-4V's diagnostic accuracy and clinical decision-making abilities are poor, posing risks to patient safety.
arXiv Detail & Related papers (2023-11-14T17:06:09Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Multimodal ChatGPT for Medical Applications: an Experimental Study of
GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V)
Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets.
The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z) - Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context.
For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z) - Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for
Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis.
Our evaluation encompasses 17 human body systems.
GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy.
It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Evaluating Large Language Models on a Highly-specialized Topic,
Radiation Oncology Physics [9.167699167689369]
This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics.
We developed an exam consisting of 100 radiation oncology physics questions.
ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts.
arXiv Detail & Related papers (2023-04-01T06:04:58Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.