Capabilities of GPT-4 on Medical Challenge Problems
- URL: http://arxiv.org/abs/2303.13375v2
- Date: Wed, 12 Apr 2023 16:48:39 GMT
- Title: Capabilities of GPT-4 on Medical Challenge Problems
- Authors: Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, Eric
Horvitz
- Abstract summary: GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
- Score: 23.399857819743158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in
natural language understanding and generation across various domains, including
medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art
LLM, on medical competency examinations and benchmark datasets. GPT-4 is a
general-purpose model that is not specialized for medical problems through
training or engineered to solve clinical tasks. Our analysis covers two sets of
official practice materials for the USMLE, a three-step examination program
used to assess clinical competency and grant licensure in the United States. We
also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond
measuring model performance, experiments were conducted to investigate the
influence of test questions containing both text and images on model
performance, probe for memorization of content during training, and study
probability calibration, which is of critical importance in high-stakes
applications like medicine. Our results show that GPT-4, without any
specialized prompt crafting, exceeds the passing score on USMLE by over 20
points and outperforms earlier general-purpose models (GPT-3.5) as well as
models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned
version of Flan-PaLM 540B). In addition, GPT-4 is significantly better
calibrated than GPT-3.5, demonstrating a much-improved ability to predict the
likelihood that its answers are correct. We also explore the behavior of the
model qualitatively through a case study that shows the ability of GPT-4 to
explain medical reasoning, personalize explanations to students, and
interactively craft new counterfactual scenarios around a medical case.
Implications of the findings are discussed for potential uses of GPT-4 in
medical education, assessment, and clinical practice, with appropriate
attention to challenges of accuracy and safety.
Related papers
- Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa'nstwowy Egzamin Specjalizacyjny, PES) dataset.
The GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others.
arXiv Detail & Related papers (2024-04-29T09:08:22Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Multimodal ChatGPT for Medical Applications: an Experimental Study of
GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V)
Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets.
The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z) - The Potential and Pitfalls of using a Large Language Model such as
ChatGPT or GPT-4 as a Clinical Assistant [12.017491902296836]
ChatGPT and GPT-4 have demonstrated promising performance on several medical domain tasks.
We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database.
For patient assessment, GPT-4 can accurately diagnose three out of four times.
arXiv Detail & Related papers (2023-07-16T21:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.