Does CLIP Benefit Visual Question Answering in the Medical Domain as
Much as it Does in the General Domain?
- URL: http://arxiv.org/abs/2112.13906v1
- Date: Mon, 27 Dec 2021 21:19:23 GMT
- Title: Does CLIP Benefit Visual Question Answering in the Medical Domain as
Much as it Does in the General Domain?
- Authors: Sedigheh Eslami, Gerard de Melo, Christoph Meinel
- Abstract summary: This work evaluates the effectiveness of Contrastive Language--Image Pre-training (CLIP) for the task of Medical Visual Question Answering (MedVQA)
Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning)
For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only on visual data.
- Score: 38.229972218195336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language--Image Pre-training (CLIP) has shown remarkable success
in learning with cross-modal supervision from extensive amounts of image--text
pairs collected online. Thus far, the effectiveness of CLIP has been
investigated primarily in general-domain multimodal problems. This work
evaluates the effectiveness of CLIP for the task of Medical Visual Question
Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of
CLIP for the medical domain based on PubMed articles. Our experiments are
conducted on two MedVQA benchmark datasets and investigate two MedVQA methods,
MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via
Conditional Reasoning). For each of these, we assess the merits of visual
representation learning using PubMedCLIP, the original CLIP, and
state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only
on visual data. We open source the code for our MedVQA pipeline and
pre-training PubMedCLIP. CLIP and PubMedCLIP achieve improvements in comparison
to MAML's visual encoder. PubMedCLIP achieves the best results with gains in
the overall accuracy of up to 3%. Individual examples illustrate the strengths
of PubMedCLIP in comparison to the previously widely used MAML networks. Visual
representation learning with language supervision in PubMedCLIP leads to
noticeable improvements for MedVQA. Our experiments reveal distributional
differences in the two MedVQA benchmark datasets that have not been imparted in
previous work and cause different back-end visual encoders in PubMedCLIP to
exhibit different behavior on these datasets. Moreover, we witness fundamental
performance differences of VQA in general versus medical domains.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging [68.6715007665896]
FedMedICL is a unified framework and benchmark to holistically evaluate federated medical imaging challenges.
We comprehensively evaluate several popular methods on six diverse medical imaging datasets.
We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments.
arXiv Detail & Related papers (2024-07-11T19:12:23Z) - HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale [29.956053068653734]
We create the PubMedVision dataset with 1.3 million medical VQA samples.
Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios.
arXiv Detail & Related papers (2024-06-27T15:50:41Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.
It has shown promising results across various tasks, attributable to its generalizability and interpretability.
Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical
Documents [35.64805788623848]
We build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset.
PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level.
While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art results on various downstream tasks.
arXiv Detail & Related papers (2023-03-13T16:13:16Z) - Understanding the Tricks of Deep Learning in Medical Image Segmentation:
Challenges and Future Directions [66.40971096248946]
In this paper, we collect a series of MedISeg tricks for different model implementation phases.
We experimentally explore the effectiveness of these tricks on consistent baselines.
We also open-sourced a strong MedISeg repository, where each component has the advantage of plug-and-play.
arXiv Detail & Related papers (2022-09-21T12:30:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.