A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis
- URL: http://arxiv.org/abs/2310.20381v5
- Date: Tue, 30 Jan 2024 19:32:19 GMT
- Title: A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis
- Authors: Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lei Wang, Lingqiao
Liu, Leyang Cui, Zhaopeng Tu, Longyue Wang, Luping Zhou
- Abstract summary: GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
- Score: 87.25494411021066
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This work conducts an evaluation of GPT-4V's multimodal capability for
medical image analysis, with a focus on three representative tasks of radiology
report generation, medical visual question answering, and medical visual
grounding. For the evaluation, a set of prompts is designed for each task to
induce the corresponding capability of GPT-4V to produce sufficiently good
outputs. Three evaluation ways including quantitative analysis, human
evaluation, and case study are employed to achieve an in-depth and extensive
evaluation. Our evaluation shows that GPT-4V excels in understanding medical
images and is able to generate high-quality radiology reports and effectively
answer questions about medical images. Meanwhile, it is found that its
performance for medical visual grounding needs to be substantially improved. In
addition, we observe the discrepancy between the evaluation outcome from
quantitative analysis and that from human evaluation. This discrepancy suggests
the limitations of conventional metrics in assessing the performance of large
language models like GPT-4V and the necessity of developing new metrics for
automatic quantitative analysis.
Related papers
- Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on
Prompt Engineering Strategies [28.98518677093905]
GPT-4V, OpenAI's latest large vision-language model, has piqued considerable interest for its potential in medical applications.
Recent studies and internal reviews highlight its underperformance in specialized medical tasks.
This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc.
arXiv Detail & Related papers (2023-12-07T15:05:59Z) - Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision.
We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more.
Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Multimodal ChatGPT for Medical Applications: an Experimental Study of
GPT-4V [20.84152508192388]
We critically evaluate the capabilities of the state-of-the-art multimodal large language model, GPT-4 with Vision (GPT-4V)
Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets.
The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics.
arXiv Detail & Related papers (2023-10-29T16:26:28Z) - Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for
Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis.
Our evaluation encompasses 17 human body systems.
GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy.
It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z) - The Potential and Pitfalls of using a Large Language Model such as
ChatGPT or GPT-4 as a Clinical Assistant [12.017491902296836]
ChatGPT and GPT-4 have demonstrated promising performance on several medical domain tasks.
We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database.
For patient assessment, GPT-4 can accurately diagnose three out of four times.
arXiv Detail & Related papers (2023-07-16T21:19:47Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z) - Predicting Patient Readmission Risk from Medical Text via Knowledge
Graph Enhanced Multiview Graph Convolution [67.72545656557858]
We propose a new method that uses medical text of Electronic Health Records for prediction.
We represent discharge summaries of patients with multiview graphs enhanced by an external knowledge graph.
Experimental results prove the effectiveness of our method, yielding state-of-the-art performance.
arXiv Detail & Related papers (2021-12-19T01:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.