Accuracy of a Vision-Language Model on Challenging Medical Cases
- URL: http://arxiv.org/abs/2311.05591v1
- Date: Thu, 9 Nov 2023 18:48:02 GMT
- Title: Accuracy of a Vision-Language Model on Challenging Medical Cases
- Authors: Thomas Buckley, James A. Diao, Adam Rodman, Arjun K. Manrai
- Abstract summary: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases.
We evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents.
We also conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences.
- Score: 1.7726473251723847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: General-purpose large language models that utilize both text and
images have not been evaluated on a diverse array of challenging medical cases.
Methods: Using 934 cases from the NEJM Image Challenge published between 2005
and 2023, we evaluated the accuracy of the recently released Generative
Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human
respondents overall and stratified by question difficulty, image type, and skin
tone. We further conducted a physician evaluation of GPT-4V on 69 NEJM
clinicopathological conferences (CPCs). Analyses were conducted for models
utilizing text alone, images alone, and both text and images.
Results: GPT-4V achieved an overall accuracy of 61% (95% CI, 58 to 64%)
compared to 49% (95% CI, 49 to 50%) for humans. GPT-4V outperformed humans at
all levels of difficulty and disagreement, skin tones, and image types; the
exception was radiographic images, where performance was equivalent between
GPT-4V and human respondents. Longer, more informative captions were associated
with improved performance for GPT-4V but similar performance for human
respondents. GPT-4V included the correct diagnosis in its differential for 80%
(95% CI, 68 to 88%) of CPCs when using text alone, compared to 58% (95% CI, 45
to 70%) of CPCs when using both images and text.
Conclusions: GPT-4V outperformed human respondents on challenging medical
cases and was able to synthesize information from both images and text, but
performance deteriorated when images were added to highly informative text.
Overall, our results suggest that multimodal AI models may be useful in medical
diagnostic reasoning but that their accuracy may depend heavily on context.
Related papers
- MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification [19.29480118378639]
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels.
This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification.
arXiv Detail & Related papers (2025-02-11T09:42:13Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning [12.10183458424711]
We present a novel medical image captioning method guided by the segment anything model (SAM)
Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
arXiv Detail & Related papers (2023-11-02T05:44:13Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Benchmarking Robustness of Multimodal Image-Text Models under
Distribution Shift [50.64474103506595]
We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks.
Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
arXiv Detail & Related papers (2022-12-15T18:52:03Z) - RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [7.618389245539657]
We develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays.
We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts.
We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images.
arXiv Detail & Related papers (2022-11-23T06:58:09Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Discriminative Cross-Modal Data Augmentation for Medical Imaging
Applications [24.06277026586584]
Deep learning methods have shown great success in medical image analysis, they require a number of medical images to train.
Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training.
We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality.
arXiv Detail & Related papers (2020-10-07T15:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.