Related papers: Accuracy of a Vision-Language Model on Challenging Medical Cases

Accuracy of a Vision-Language Model on Challenging Medical Cases

URL: http://arxiv.org/abs/2311.05591v1
Date: Thu, 9 Nov 2023 18:48:02 GMT
Title: Accuracy of a Vision-Language Model on Challenging Medical Cases
Authors: Thomas Buckley, James A. Diao, Adam Rodman, Arjun K. Manrai
Abstract summary: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases. We evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents. We also conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences.
Score: 1.7726473251723847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases. Methods: Using 934 cases from the NEJM Image Challenge published between 2005 and 2023, we evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents overall and stratified by question difficulty, image type, and skin tone. We further conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences (CPCs). Analyses were conducted for models utilizing text alone, images alone, and both text and images. Results: GPT-4V achieved an overall accuracy of 61% (95% CI, 58 to 64%) compared to 49% (95% CI, 49 to 50%) for humans. GPT-4V outperformed humans at all levels of difficulty and disagreement, skin tones, and image types; the exception was radiographic images, where performance was equivalent between GPT-4V and human respondents. Longer, more informative captions were associated with improved performance for GPT-4V but similar performance for human respondents. GPT-4V included the correct diagnosis in its differential for 80% (95% CI, 68 to 88%) of CPCs when using text alone, compared to 58% (95% CI, 45 to 70%) of CPCs when using both images and text. Conclusions: GPT-4V outperformed human respondents on challenging medical cases and was able to synthesize information from both images and text, but performance deteriorated when images were added to highly informative text. Overall, our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy may depend heavily on context.

Related papers

MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification [19.29480118378639]
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification.
arXiv Detail & Related papers (2025-02-11T09:42:13Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning [12.10183458424711]
We present a novel medical image captioning method guided by the segment anything model (SAM) Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
arXiv Detail & Related papers (2023-11-02T05:44:13Z)
Unified Medical Image-Text-Label Contrastive Learning With Continuous Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts. We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z)
Medical diffusion on a budget: Textual Inversion for medical image generation [3.0826983115939823]
Training from scratch requires large captioned datasets and significant computational resources. This work shows that adapting pre-trained Stable Diffusion models to medical imaging modalities is achievable by training text embeddings. The trained embeddings are compact (less than 1 MB), enabling easy data sharing with reduced privacy concerns.
arXiv Detail & Related papers (2023-03-23T16:50:19Z)
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. We propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift [50.64474103506595]
We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks. Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
arXiv Detail & Related papers (2022-12-15T18:52:03Z)
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [7.618389245539657]
We develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images.
arXiv Detail & Related papers (2022-11-23T06:58:09Z)
Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning. It aims to extract both the common information and the complementary information in an adversarial setting. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z)
Discriminative Cross-Modal Data Augmentation for Medical Imaging Applications [24.06277026586584]
Deep learning methods have shown great success in medical image analysis, they require a number of medical images to train. Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training. We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality.
arXiv Detail & Related papers (2020-10-07T15:07:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.