Evaluating Large Language Models on a Highly-specialized Topic,
Radiation Oncology Physics
- URL: http://arxiv.org/abs/2304.01938v1
- Date: Sat, 1 Apr 2023 06:04:58 GMT
- Title: Evaluating Large Language Models on a Highly-specialized Topic,
Radiation Oncology Physics
- Authors: Jason Holmes, Zhengliang Liu, Lian Zhang, Yuzhen Ding, Terence T. Sio,
Lisa A. McGee, Jonathan B. Ashman, Xiang Li, Tianming Liu, Jiajian Shen, Wei
Liu
- Abstract summary: This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics.
We developed an exam consisting of 100 radiation oncology physics questions.
ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts.
- Score: 9.167699167689369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first study to investigate Large Language Models (LLMs) in
answering radiation oncology physics questions. Because popular exams like AP
Physics, LSAT, and GRE have large test-taker populations and ample test
preparation resources in circulation, they may not allow for accurately
assessing the true potential of LLMs. This paper proposes evaluating LLMs on a
highly-specialized topic, radiation oncology physics, which may be more
pertinent to scientific and medical communities in addition to being a valuable
benchmark of LLMs. We developed an exam consisting of 100 radiation oncology
physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT
(GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against
medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs
as well as medical physicists, on average. The performance of ChatGPT (GPT-4)
was further improved when prompted to explain first, then answer. ChatGPT
(GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices
across a number of trials, whether correct or incorrect, a characteristic that
was not observed in the human test groups. In evaluating ChatGPTs (GPT-4)
deductive reasoning ability using a novel approach (substituting the correct
answer with "None of the above choices is the correct answer."), ChatGPT
(GPT-4) demonstrated surprising accuracy, suggesting the potential presence of
an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall,
its intrinsic properties did not allow for further improvement when scoring
based on a majority vote across trials. In contrast, a team of medical
physicists were able to greatly outperform ChatGPT (GPT-4) using a majority
vote. This study suggests a great potential for LLMs to work alongside
radiation oncology experts as highly knowledgeable assistants.
Related papers
- Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support [1.592751576537053]
An attempt to assess GPT-4's performance in radiation oncology was made via a dedicated 100-question examination.
GPT-4's performance on a broader field of clinical radiation oncology is benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam.
Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies.
arXiv Detail & Related papers (2025-01-04T17:57:33Z) - The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts.
The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - IvyGPT: InteractiVe Chinese pathwaY language model in medical domain [7.5386393444603454]
General large language models (LLMs) such as ChatGPT have shown remarkable success.
We propose IvyGPT, an LLM based on LLaMA that is trained and fine-tuned with high-quality medical question-answer.
In the training, we used QLoRA to train 33 billion parameters on a small number of NVIDIA A100 (80GB)
Experimental results show that IvyGPT has outperformed other medical GPT models.
arXiv Detail & Related papers (2023-07-20T01:11:14Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology [7.094683738932199]
We evaluate the performance of ChatGPT-4 in radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases.
For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively.
ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry.
arXiv Detail & Related papers (2023-04-24T09:50:39Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.