Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning
- URL: http://arxiv.org/abs/2210.06044v1
- Date: Wed, 12 Oct 2022 09:31:39 GMT
- Title: Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning
- Authors: Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, Lequan Yu
- Abstract summary: We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
- Score: 24.215619918283462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning medical visual representations directly from paired radiology
reports has become an emerging topic in representation learning. However,
existing medical image-text joint learning methods are limited by instance or
local supervision analysis, ignoring disease-level semantic correspondences. In
this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA)
framework for generalized medical visual representation learning by harnessing
the naturally exhibited semantic correspondences between medical image and
radiology reports at three different levels, i.e., pathological region-level,
instance-level, and disease-level. Specifically, we first incorporate the
instance-wise alignment module by maximizing the agreement between image-report
pairs. Further, for token-wise alignment, we introduce a bidirectional
cross-attention strategy to explicitly learn the matching between fine-grained
visual tokens and text tokens, followed by contrastive learning to align them.
More important, to leverage the high-level inter-subject relationship semantic
(e.g., disease) correspondences, we design a novel cross-modal disease-level
alignment paradigm to enforce the cross-modal cluster assignment consistency.
Extensive experimental results on seven downstream medical image datasets
covering image classification, object detection, and semantic segmentation
tasks demonstrate the stable and superior performance of our framework.
Related papers
- Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.
Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.
Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z) - See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning [12.40415847810958]
We introduce a Pathological Clue-driven Representation Learning (PCRL) model to build cross-modal representations based on pathological clues.
Specifically, we construct pathological clues from perspectives of segmented regions, pathological entities, and report themes.
To adapt the representations for the text generation task, we bridge the gap between representation learning and report generation by using a unified large language model (LLM) with task-tailored instructions.
arXiv Detail & Related papers (2024-09-29T12:08:20Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Anatomical Structure-Guided Medical Vision-Language Pre-training [21.68719061251635]
We propose an Anatomical Structure-Guided (ASG) framework for learning medical visual representations.
For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists.
For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample.
arXiv Detail & Related papers (2024-03-14T11:29:47Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Enhancing medical vision-language contrastive learning via
inter-matching relation modelling [14.777259981193726]
Medical image representations can be learned through medical vision-language contrastive learning (mVLCL)
Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings.
We propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF)
arXiv Detail & Related papers (2024-01-19T05:28:51Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Cross-level Contrastive Learning and Consistency Constraint for
Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation.
With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.