UnICLAM:Contrastive Representation Learning with Adversarial Masking for
Unified and Interpretable Medical Vision Question Answering
- URL: http://arxiv.org/abs/2212.10729v3
- Date: Wed, 27 Sep 2023 11:30:06 GMT
- Title: UnICLAM:Contrastive Representation Learning with Adversarial Masking for
Unified and Interpretable Medical Vision Question Answering
- Authors: Chenlu Zhan, Peng Peng, Hongsen Wang, Tao Chen, Hongwei Wang
- Abstract summary: Current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces.
We propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking.
Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA models.
- Score: 7.2486693553383805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (Medical-VQA) aims to to answer clinical
questions regarding radiology images, assisting doctors with decision-making
options. Nevertheless, current Medical-VQA models learn cross-modal
representations through residing vision and texture encoders in dual separate
spaces, which lead to indirect semantic alignment. In this paper, we propose
UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive
Representation Learning with Adversarial Masking. Specifically, to learn an
aligned image-text representation, we first establish a unified dual-stream
pre-training structure with the gradually soft-parameter sharing strategy.
Technically, the proposed strategy learns a constraint for the vision and
texture encoders to be close in a same space, which is gradually loosened as
the higher number of layers. Moreover, for grasping the unified semantic
representation, we extend the adversarial masking data augmentation to the
contrastive representation learning of vision and text in a unified manner.
Concretely, while the encoder training minimizes the distance between original
and masking samples, the adversarial masking module keeps adversarial learning
to conversely maximize the distance. Furthermore, we also intuitively take a
further exploration to the unified adversarial masking augmentation model,
which improves the potential ante-hoc interpretability with remarkable
performance and efficiency. Experimental results on VQA-RAD and SLAKE public
benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art
Medical-VQA models. More importantly, we make an additional discussion about
the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM
exhibits superior few-shot adaption performance in practical disease diagnosis.
Related papers
- ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Analyzing the Effect of $k$-Space Features in MRI Classification Models [0.0]
We have developed an explainable AI methodology tailored for medical imaging.
We employ a Convolutional Neural Network (CNN) that analyzes MRI scans across both image and frequency domains.
This approach not only enhances early training efficiency but also deepens our understanding of how additional features impact the model predictions.
arXiv Detail & Related papers (2024-09-20T15:43:26Z) - CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models [51.70129969269271]
We introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE)
Our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
arXiv Detail & Related papers (2024-06-04T03:04:21Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning [24.215619918283462]
We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
arXiv Detail & Related papers (2022-10-12T09:31:39Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.