Multimodal Foundation Models Exploit Text to Make Medical Image Predictions
- URL: http://arxiv.org/abs/2311.05591v2
- Date: Mon, 25 Nov 2024 15:38:17 GMT
- Title: Multimodal Foundation Models Exploit Text to Make Medical Image Predictions
- Authors: Thomas Buckley, James A. Diao, Pranav Rajpurkar, Adam Rodman, Arjun K. Manrai,
- Abstract summary: We evaluate the mechanisms by which multimodal foundation models integrate and prioritize different data modalities, including images and text.
Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.
- Score: 3.4230952713864373
- License:
- Abstract: Multimodal foundation models have shown compelling but conflicting performance in medical image interpretation. However, the mechanisms by which these models integrate and prioritize different data modalities, including images and text, remain poorly understood. Here, using a diverse collection of 1014 multimodal medical cases, we evaluate the unimodal and multimodal image interpretation abilities of proprietary (GPT-4, Gemini Pro 1.0) and open-source (Llama-3.2-90B, LLaVA-Med-v1.5) multimodal foundational models with and without the use of text descriptions. Across all models, image predictions were largely driven by exploiting text, with accuracy increasing monotonically with the amount of informative text. By contrast, human performance on medical image interpretation did not improve with informative text. Exploitation of text is a double-edged sword; we show that even mild suggestions of an incorrect diagnosis in text diminishes image-based classification, reducing performance dramatically in cases the model could previously answer with images alone. Finally, we conducted a physician evaluation of model performance on long-form medical cases, finding that the provision of images either reduced or had no effect on model performance when text is already highly informative. Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning [12.10183458424711]
We present a novel medical image captioning method guided by the segment anything model (SAM)
Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
arXiv Detail & Related papers (2023-11-02T05:44:13Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Medical diffusion on a budget: Textual Inversion for medical image generation [3.0826983115939823]
Training from scratch requires large captioned datasets and significant computational resources.
This work shows that adapting pre-trained Stable Diffusion models to medical imaging modalities is achievable by training text embeddings.
The trained embeddings are compact (less than 1 MB), enabling easy data sharing with reduced privacy concerns.
arXiv Detail & Related papers (2023-03-23T16:50:19Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Benchmarking Robustness of Multimodal Image-Text Models under
Distribution Shift [50.64474103506595]
We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks.
Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
arXiv Detail & Related papers (2022-12-15T18:52:03Z) - RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [7.618389245539657]
We develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays.
We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts.
We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images.
arXiv Detail & Related papers (2022-11-23T06:58:09Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Discriminative Cross-Modal Data Augmentation for Medical Imaging
Applications [24.06277026586584]
Deep learning methods have shown great success in medical image analysis, they require a number of medical images to train.
Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training.
We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality.
arXiv Detail & Related papers (2020-10-07T15:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.