M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
- URL: http://arxiv.org/abs/2509.01360v1
- Date: Mon, 01 Sep 2025 10:59:39 GMT
- Title: M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
- Authors: Che Liu, Zheng Jiang, Chengyu Fang, Heng Guo, Yan-Jie Zhou, Jiaqi Qu, Le Lu, Minfeng Xu,
- Abstract summary: We train M3Ret, a unified visual encoder, without any modality-specific customization.<n>It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.<n>Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP.
- Score: 24.846428105192405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.
Related papers
- Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z) - Towards Generalisable Foundation Models for 3D Brain MRI [5.527537739064968]
We introduce BrainFound, a self-supervised foundation model for brain MRI built by extending DINO-v2.<n>BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating information from sequential MRI slices.<n>It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation.
arXiv Detail & Related papers (2025-10-27T15:19:46Z) - Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement [15.28003304776022]
In-context learning offers a promising paradigm for universal medical image analysis.<n>We present textbfMedverse, a universal ICL model for 3D medical imaging trained on 22 datasets.<n>Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine.
arXiv Detail & Related papers (2025-09-11T08:10:49Z) - Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities [0.0]
BM-MAE is a masked image modeling pre-training strategy tailored for multimodal MRI data.<n>It seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information.<n>It can quickly and efficiently reconstruct missing modalities, highlighting its practical value.
arXiv Detail & Related papers (2025-05-01T14:51:30Z) - Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning [0.15749416770494706]
Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI.<n>We present a emph-sequence-invariant self-supervised framework leveraging quantitative MRI (qMRI)<n> Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches.
arXiv Detail & Related papers (2025-01-21T11:27:54Z) - MRGen: Segmentation Data Engine for Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities.<n>We present MRGen, a data engine for controllable medical image synthesis conditioned on text prompts and segmentation masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.<n>Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - Learning Brain Tumor Representation in 3D High-Resolution MR Images via Interpretable State Space Models [42.55786269051626]
We propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively.
We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes.
Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.
arXiv Detail & Related papers (2024-09-12T04:36:50Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Building Universal Foundation Models for Medical Image Analysis with
Spatially Adaptive Networks [5.661631789478932]
We propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure.
We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets.
The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model.
arXiv Detail & Related papers (2023-12-12T08:33:45Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.