3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
- URL: http://arxiv.org/abs/2602.23652v2
- Date: Tue, 03 Mar 2026 05:58:36 GMT
- Title: 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
- Authors: Haowen Zhu, Ning Yin, Xiaogen Zhou,
- Abstract summary: We propose MedMAP, a framework that enhances vision-language representation learning in 3D MRI.<n>MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection.<n>Our experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection.
- Score: 0.31351527202068447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.
Related papers
- Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models [5.020980730631682]
Existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension.<n>Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions.<n>We propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module.
arXiv Detail & Related papers (2025-09-11T00:12:59Z) - Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z) - M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision [24.846428105192405]
We train M3Ret, a unified visual encoder, without any modality-specific customization.<n>It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.<n>Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP.
arXiv Detail & Related papers (2025-09-01T10:59:39Z) - OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding [35.35197484810533]
We present OmniV-Med, a unified framework for multimodal medical understanding.<n>We devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture.<n>We introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data and medical videos.
arXiv Detail & Related papers (2025-04-20T17:53:56Z) - Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation [40.73779035606757]
We introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation.<n>Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views.<n>MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases.
arXiv Detail & Related papers (2024-12-18T07:19:48Z) - Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model [17.69323209661274]
We propose Med-2E3, a 3D medical MLLM that integrates a dual 3D-2D encoder architecture.<n>To aggregate 2D features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module.<n>Experiments on large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS exhibits task-specific attention distribution.
arXiv Detail & Related papers (2024-11-19T09:59:59Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.<n>Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - Brain3D: Generating 3D Objects from fMRI [78.46936519561298]
We design a novel 3D object representation learning method, Brain3D, that takes as input the fMRI data of a subject.<n>We show that our model captures the distinct functionalities of each region of human vision system.<n>Preliminary evaluations indicate that Brain3D can successfully identify the disordered brain regions in simulated scenarios.
arXiv Detail & Related papers (2024-05-24T06:06:11Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Med3DInsight: Enhancing 3D Medical Image Understanding with 2D
Multi-Modal Large Language Models [1.64647940449869]
Existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume.
We propose Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a Plane-Slice-Aware Transformer (PSAT) module.
arXiv Detail & Related papers (2024-03-08T08:15:53Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.