Masked Contrastive Reconstruction for Cross-modal Medical Image-Report
Retrieval
- URL: http://arxiv.org/abs/2312.15840v2
- Date: Wed, 27 Dec 2023 03:00:10 GMT
- Title: Masked Contrastive Reconstruction for Cross-modal Medical Image-Report
Retrieval
- Authors: Zeqiang Wei, Kai Jin, Xiuzhuang Zhou
- Abstract summary: Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks.
We propose an efficient framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks.
This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time.
- Score: 3.5314225883644945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal medical image-report retrieval task plays a significant role in
clinical diagnosis and various medical generative tasks. Eliminating
heterogeneity between different modalities to enhance semantic consistency is
the key challenge of this task. The current Vision-Language Pretraining (VLP)
models, with cross-modal contrastive learning and masked reconstruction as
joint training tasks, can effectively enhance the performance of cross-modal
retrieval. This framework typically employs dual-stream inputs, using unmasked
data for cross-modal contrastive learning and masked data for reconstruction.
However, due to task competition and information interference caused by
significant differences between the inputs of the two proxy tasks, the
effectiveness of representation learning for intra-modal and cross-modal
features is limited. In this paper, we propose an efficient VLP framework named
Masked Contrastive and Reconstruction (MCR), which takes masked data as the
sole input for both tasks. This enhances task connections, reducing information
interference and competition between them, while also substantially decreasing
the required GPU memory and training time. Moreover, we introduce a new
modality alignment strategy named Mapping before Aggregation (MbA). Unlike
previous methods, MbA maps different modalities to a common feature space
before conducting local feature aggregation, thereby reducing the loss of
fine-grained semantic information necessary for improved modality alignment.
Qualitative and quantitative experiments conducted on the MIMIC-CXR dataset
validate the effectiveness of our approach, demonstrating state-of-the-art
performance in medical cross-modal retrieval tasks.
Related papers
- ICH-SCNet: Intracerebral Hemorrhage Segmentation and Prognosis Classification Network Using CLIP-guided SAM mechanism [12.469269425813607]
Intracerebral hemorrhage (ICH) is the most fatal subtype of stroke and is characterized by a high incidence of disability.
Existing approaches address these two tasks independently and predominantly focus on imaging data alone.
This paper introduces a multi-task network, ICH-SCNet, designed for both ICH segmentation and prognosis classification.
arXiv Detail & Related papers (2024-11-07T12:34:25Z) - Adaptive Affinity-Based Generalization For MRI Imaging Segmentation Across Resource-Limited Settings [1.5703963908242198]
This paper introduces a novel relation-based knowledge framework by seamlessly combining adaptive affinity-based and kernel-based distillation.
To validate our innovative approach, we conducted experiments on publicly available multi-source prostate MRI data.
arXiv Detail & Related papers (2024-04-03T13:35:51Z) - Complementary Information Mutual Learning for Multimodality Medical Image Segmentation [25.100661341840524]
This paper presents the complementary information mutual learning framework, which can mathematically model and address the negative impact of inter-modal redundant information.
Numerical results indicate that CIML efficiently removes redundant information between modalities, outperforming SOTA methods regarding validation accuracy and segmentation effect.
arXiv Detail & Related papers (2024-01-05T09:21:45Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.