The Potential and Pitfalls of using a Large Language Model such as
ChatGPT or GPT-4 as a Clinical Assistant
- URL: http://arxiv.org/abs/2307.08152v1
- Date: Sun, 16 Jul 2023 21:19:47 GMT
- Title: The Potential and Pitfalls of using a Large Language Model such as
ChatGPT or GPT-4 as a Clinical Assistant
- Authors: Jingqing Zhang, Kai Sun, Akshay Jagadeesh, Mahta Ghahfarokhi, Deepa
Gupta, Ashok Gupta, Vibhor Gupta, Yike Guo
- Abstract summary: ChatGPT and GPT-4 have demonstrated promising performance on several medical domain tasks.
We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database.
For patient assessment, GPT-4 can accurately diagnose three out of four times.
- Score: 12.017491902296836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have demonstrated promising performance of ChatGPT and GPT-4
on several medical domain tasks. However, none have assessed its performance
using a large-scale real-world electronic health record database, nor have
evaluated its utility in providing clinical diagnostic assistance for patients
across a full range of disease presentation. We performed two analyses using
ChatGPT and GPT-4, one to identify patients with specific medical diagnoses
using a real-world large electronic health record database and the other, in
providing diagnostic assistance to healthcare workers in the prospective
evaluation of hypothetical patients. Our results show that GPT-4 across disease
classification tasks with chain of thought and few-shot prompting can achieve
performance as high as 96% F1 scores. For patient assessment, GPT-4 can
accurately diagnose three out of four times. However, there were mentions of
factually incorrect statements, overlooking crucial medical findings,
recommendations for unnecessary investigations and overtreatment. These issues
coupled with privacy concerns, make these models currently inadequate for real
world clinical use. However, limited data and time needed for prompt
engineering in comparison to configuration of conventional machine learning
workflows highlight their potential for scalability across healthcare
applications.
Related papers
- Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses.
GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data.
Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z) - Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on
Prompt Engineering Strategies [28.98518677093905]
GPT-4V, OpenAI's latest large vision-language model, has piqued considerable interest for its potential in medical applications.
Recent studies and internal reviews highlight its underperformance in specialized medical tasks.
This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc.
arXiv Detail & Related papers (2023-12-07T15:05:59Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for
Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis.
Our evaluation encompasses 17 human body systems.
GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy.
It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z) - The Case Records of ChatGPT: Language Models and Complex Clinical
Questions [0.35157846138914034]
The accuracy of large language AI models GPT4 and GPT3.5 in diagnosing complex clinical cases was investigated.
GPT4 and GPT3.5 accurately provided the correct diagnosis in 26% and 22% of cases in one attempt, and 46% and 42% within three attempts, respectively.
arXiv Detail & Related papers (2023-05-09T16:58:32Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.