Multi-View and Multi-Scale Alignment for Contrastive Language-Image
Pre-training in Mammography
- URL: http://arxiv.org/abs/2409.18119v1
- Date: Thu, 26 Sep 2024 17:56:59 GMT
- Title: Multi-View and Multi-Scale Alignment for Contrastive Language-Image
Pre-training in Mammography
- Authors: Yuexi Du, John Onofrey, Nicha C. Dvornek
- Abstract summary: Contrastive Language-Image Pre-training shows promise in medical image analysis but requires substantial data and computational resources.
Here, we propose the first adaptation of the full CLIP model to mammography.
- Score: 4.500815515502233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) shows promise in medical image
analysis but requires substantial data and computational resources. Due to
these restrictions, existing CLIP applications in medical imaging focus mainly
on modalities like chest X-rays that have abundant image-report data available,
leaving many other important modalities under-explored. Here, we propose the
first adaptation of the full CLIP model to mammography, which presents
significant challenges due to labeled data scarcity, high-resolution images
with small regions of interest, and data imbalance. We first develop a
specialized supervision framework for mammography that leverages its multi-view
nature. Furthermore, we design a symmetric local alignment module to better
focus on detailed features in high-resolution images. Lastly, we incorporate a
parameter-efficient fine-tuning approach for large language models pre-trained
with medical knowledge to address data limitations. Our multi-view and
multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for
three different tasks on two large real-world mammography datasets, EMBED and
RSNA-Mammo, with only 52% model size compared with the largest baseline.
Related papers
- Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models [30.044545011553172]
This paper proposes Brain-Adapter, a novel approach that incorporates an extra bottleneck layer to learn new knowledge and instill it into the original pre-trained knowledge.
Experiments demonstrated the effectiveness of our approach in integrating multimodal data to significantly improve the diagnosis accuracy without high computational costs.
arXiv Detail & Related papers (2025-01-27T18:20:49Z) - UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.
UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.
We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z) - Discriminative Hamiltonian Variational Autoencoder for Accurate Tumor Segmentation in Data-Scarce Regimes [2.8498944632323755]
We propose an end-to-end hybrid architecture for medical image segmentation.
We use Hamiltonian Variational Autoencoders (HVAE) and a discriminative regularization to improve the quality of generated images.
Our architecture operates on a slice-by-slice basis to segment 3D volumes, capitilizing on the richly augmented dataset.
arXiv Detail & Related papers (2024-06-17T15:42:08Z) - Inter-slice Super-resolution of Magnetic Resonance Images by Pre-training and Self-supervised Fine-tuning [49.197385954021456]
In clinical practice, 2D magnetic resonance (MR) sequences are widely adopted. While individual 2D slices can be stacked to form a 3D volume, the relatively large slice spacing can pose challenges for visualization and subsequent analysis tasks.
To reduce slice spacing, deep-learning-based super-resolution techniques are widely investigated.
Most current solutions require a substantial number of paired high-resolution and low-resolution images for supervised training, which are typically unavailable in real-world scenarios.
arXiv Detail & Related papers (2024-06-10T02:20:26Z) - Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography [12.159236541184754]
Mammo-CLIP is the first VLM pre-trained on a substantial amount of screening mammogram-report pairs.
experiments on two public datasets demonstrate strong performance in classifying and localizing various mammographic attributes.
arXiv Detail & Related papers (2024-05-20T08:27:39Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - End-to-end autoencoding architecture for the simultaneous generation of
medical images and corresponding segmentation masks [3.1133049660590615]
We present an end-to-end architecture based on the Hamiltonian Variational Autoencoder (HVAE)
This approach yields an improved posterior distribution approximation compared to traditional Variational Autoencoders (VAE)
Our method outperforms generative adversarial conditions, showcasing enhancements in image quality synthesis.
arXiv Detail & Related papers (2023-11-17T11:56:53Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - PCRLv2: A Unified Visual Information Preservation Framework for
Self-supervised Pre-training in Medical Image Analysis [56.63327669853693]
We propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics.
We also address the preservation of scale information, a powerful tool in aiding image understanding.
The proposed unified SSL framework surpasses its self-supervised counterparts on various tasks.
arXiv Detail & Related papers (2023-01-02T17:47:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.