Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2305.07920v3
- Date: Sun, 22 Oct 2023 13:06:45 GMT
- Title: Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training
- Authors: Ke Zhang, Yan Yang, Jun Yu, Hanliang Jiang, Jianping Fan, Qingming
Huang and Weidong Han
- Abstract summary: We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
- Score: 55.56609500764344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the growing demand for medical imaging diagnosis has placed
a significant burden on radiologists. As a solution, Medical Vision-Language
Pre-training (Med-VLP) methods have been proposed to learn universal
representations from medical images and reports, benefiting downstream tasks
without requiring fine-grained annotations. However, existing methods have
overlooked the importance of cross-modal alignment in joint image-text
reconstruction, resulting in insufficient cross-modal interaction. To address
this limitation, we propose a unified Med-VLP framework based on Multi-task
Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment
task into the joint image-text reconstruction framework to achieve more
comprehensive cross-modal interaction, while a Global and Local Alignment (GLA)
module is designed to assist self-supervised paradigm in obtaining semantic
representations with rich domain knowledge. Furthermore, we introduce a
Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual
information to assist report reconstruction and fuse the multi-modal
representations adequately. Experimental results demonstrate that the proposed
unified approach outperforms previous methods in all downstream tasks,
including uni-modal, cross-modal, and multi-modal tasks.
Related papers
- MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report [4.340464264725625]
We introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs) and radiology/cardiology reports.
We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention.
To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach.
arXiv Detail & Related papers (2024-10-21T17:42:41Z) - Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Masked Contrastive Reconstruction for Cross-modal Medical Image-Report
Retrieval [3.5314225883644945]
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks.
We propose an efficient framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks.
This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time.
arXiv Detail & Related papers (2023-12-26T01:14:10Z) - UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic
Cross-modal Learnable Prompts [14.681493967465693]
We propose UniDCP, a Unified medical vision-language model with Dynamic Cross-modal learnable Prompts.
UniDCP is capable of performing all 8 medical uni-modal and cross-modal tasks over 14 corresponding datasets.
arXiv Detail & Related papers (2023-12-18T13:18:24Z) - CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization
in Healthcare [16.033112094191395]
We introduce the Multimodal Medical Question Summarization (MMQS) dataset.
This dataset pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs.
We also propose a framework, consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries.
arXiv Detail & Related papers (2023-12-16T03:02:05Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning [24.215619918283462]
We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
arXiv Detail & Related papers (2022-10-12T09:31:39Z) - Robust Multimodal Brain Tumor Segmentation via Feature Disentanglement
and Gated Fusion [71.87627318863612]
We propose a novel multimodal segmentation framework which is robust to the absence of imaging modalities.
Our network uses feature disentanglement to decompose the input modalities into the modality-specific appearance code.
We validate our method on the important yet challenging multimodal brain tumor segmentation task with the BRATS challenge dataset.
arXiv Detail & Related papers (2020-02-22T14:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.