Accuracy of a Vision-Language Model on Challenging Medical Cases
- URL: http://arxiv.org/abs/2311.05591v1
- Date: Thu, 9 Nov 2023 18:48:02 GMT
- Title: Accuracy of a Vision-Language Model on Challenging Medical Cases
- Authors: Thomas Buckley, James A. Diao, Adam Rodman, Arjun K. Manrai
- Abstract summary: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases.
We evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents.
We also conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences.
- Score: 1.7726473251723847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: General-purpose large language models that utilize both text and
images have not been evaluated on a diverse array of challenging medical cases.
Methods: Using 934 cases from the NEJM Image Challenge published between 2005
and 2023, we evaluated the accuracy of the recently released Generative
Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human
respondents overall and stratified by question difficulty, image type, and skin
tone. We further conducted a physician evaluation of GPT-4V on 69 NEJM
clinicopathological conferences (CPCs). Analyses were conducted for models
utilizing text alone, images alone, and both text and images.
Results: GPT-4V achieved an overall accuracy of 61% (95% CI, 58 to 64%)
compared to 49% (95% CI, 49 to 50%) for humans. GPT-4V outperformed humans at
all levels of difficulty and disagreement, skin tones, and image types; the
exception was radiographic images, where performance was equivalent between
GPT-4V and human respondents. Longer, more informative captions were associated
with improved performance for GPT-4V but similar performance for human
respondents. GPT-4V included the correct diagnosis in its differential for 80%
(95% CI, 68 to 88%) of CPCs when using text alone, compared to 58% (95% CI, 45
to 70%) of CPCs when using both images and text.
Conclusions: GPT-4V outperformed human respondents on challenging medical
cases and was able to synthesize information from both images and text, but
performance deteriorated when images were added to highly informative text.
Overall, our results suggest that multimodal AI models may be useful in medical
diagnostic reasoning but that their accuracy may depend heavily on context.
Related papers
- An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging [0.3029213689620348]
We explore the potential of the Gemini (textitgemini-1.0-pro-vision-latest) and GPT-4V models for medical image analysis.
Both Gemini AI and GPT-4V are first used to classify real versus synthetic images, followed by an interpretation and analysis of the input images.
Our early investigation presented in this work provides insights into the potential of MLLMs to assist with the classification and interpretation of retinal fundoscopy and lung X-ray images.
arXiv Detail & Related papers (2024-06-02T08:29:23Z) - The Development and Performance of a Machine Learning Based Mobile
Platform for Visually Determining the Etiology of Penile Pathology [0.0]
We developed a machine-learning model for classifying five penile diseases.
That model is currently in use globally and has the potential to improve access to diagnostic services for penile diseases.
arXiv Detail & Related papers (2024-03-13T11:05:40Z) - Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine [15.491432387608112]
Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks.
Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning.
arXiv Detail & Related papers (2024-01-16T14:41:20Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Multimodal ChatGPT for Medical Applications: an Experimental Study of
GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V)
Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets.
The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z) - Improving image quality of sparse-view lung tumor CT images with U-Net [3.5655865803527718]
We aimed at improving image quality (IQ) of sparse-view computed tomography (CT) images using a U-Net for lung metastasis detection.
Projection views can be reduced from 2,048 to 64 while maintaining IQ and the confidence of the radiologists on a satisfactory level.
arXiv Detail & Related papers (2023-07-28T12:03:55Z) - COVID-Net USPro: An Open-Source Explainable Few-Shot Deep Prototypical
Network to Monitor and Detect COVID-19 Infection from Point-of-Care
Ultrasound Images [66.63200823918429]
COVID-Net USPro monitors and detects COVID-19 positive cases with high precision and recall from minimal ultrasound images.
The network achieves 99.65% overall accuracy, 99.7% recall and 99.67% precision for COVID-19 positive cases when trained with only 5 shots.
arXiv Detail & Related papers (2023-01-04T16:05:51Z) - The Report on China-Spain Joint Clinical Testing for Rapid COVID-19 Risk
Screening by Eye-region Manifestations [59.48245489413308]
We developed and tested a COVID-19 rapid prescreening model using the eye-region images captured in China and Spain with cellphone cameras.
The performance was measured using area under receiver-operating-characteristic curve (AUC), sensitivity, specificity, accuracy, and F1.
arXiv Detail & Related papers (2021-09-18T02:28:01Z) - A novel multiple instance learning framework for COVID-19 severity
assessment via data augmentation and self-supervised learning [64.90342559393275]
How to fast and accurately assess the severity level of COVID-19 is an essential problem, when millions of people are suffering from the pandemic around the world.
We observe that there are two issues -- weak annotation and insufficient data that may obstruct automatic COVID-19 severity assessment with CT images.
Our method could obtain an average accuracy of 95.8%, with 93.6% sensitivity and 96.4% specificity, which outperformed previous works.
arXiv Detail & Related papers (2021-02-07T16:30:18Z) - Integrative Analysis for COVID-19 Patient Outcome Prediction [53.11258640541513]
We combine radiomics of lung opacities and non-imaging features from demographic data, vital signs, and laboratory findings to predict need for intensive care unit admission.
Our methods may also be applied to other lung diseases including but not limited to community acquired pneumonia.
arXiv Detail & Related papers (2020-07-20T19:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.