MedDINOv3: How to adapt vision foundation models for medical image segmentation?
- URL: http://arxiv.org/abs/2509.02379v2
- Date: Wed, 03 Sep 2025 03:08:26 GMT
- Title: MedDINOv3: How to adapt vision foundation models for medical image segmentation?
- Authors: Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang,
- Abstract summary: We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation.<n>We perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe.<n>MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks.
- Score: 16.256590269050367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
Related papers
- Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z) - M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision [24.846428105192405]
We train M3Ret, a unified visual encoder, without any modality-specific customization.<n>It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.<n>Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP.
arXiv Detail & Related papers (2025-09-01T10:59:39Z) - MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis [10.082738539201804]
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to domain shifts.<n>We introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis.<n>MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis.
arXiv Detail & Related papers (2025-05-27T19:37:51Z) - ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation [49.42525661521625]
This paper presents ShapeMamba-EM, a specialized fine-tuning method for 3D EM segmentation.
It is tested over a wide range of EM images, covering five segmentation tasks and 10 datasets.
arXiv Detail & Related papers (2024-08-26T08:59:22Z) - Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained
Image Foundation Models [13.08275555017179]
We propose ProMISe, a prompt-driven 3D medical image segmentation model using only a single point prompt.
We evaluate our model on two public datasets for colon and pancreas tumor segmentations.
arXiv Detail & Related papers (2023-10-30T16:49:03Z) - MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image
Segmentation [58.53672866662472]
We introduce a modality-agnostic SAM adaptation framework, named as MA-SAM.
Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments.
By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data.
arXiv Detail & Related papers (2023-09-16T02:41:53Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation [52.699139151447945]
We propose a novel adaptation method for transferring the segment anything model (SAM) from 2D to 3D for promptable medical image segmentation.
Our model can outperform domain state-of-the-art medical image segmentation models on 3 out of 4 tasks, specifically by 8.25%, 29.87%, and 10.11% for kidney tumor, pancreas tumor, colon cancer segmentation, and achieve similar performance for liver tumor segmentation.
arXiv Detail & Related papers (2023-06-23T12:09:52Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.