Related papers: Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

URL: http://arxiv.org/abs/2508.20909v1
Date: Thu, 28 Aug 2025 15:38:50 GMT
Title: Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Authors: Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao,
Abstract summary: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation.<n>We propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model.<n>Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases.
Score: 14.779873398321564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

Related papers

DINOv3 [62.31809406012177]
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures.<n>This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies.<n>DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks.
arXiv Detail & Related papers (2025-08-13T18:00:55Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [52.261584726401686]
We present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model.<n>Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks [61.16389024252561]
We develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data.<n>We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model.<n> Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models.
arXiv Detail & Related papers (2025-02-24T13:51:06Z)
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval [0.37478492878307323]
Content-based medical image retrieval (CBMIR) depends on image features, which can be extracted automatically or semi-automatically.<n>In this study, we used several pre-trained feature extractors from well-known pre-trained convolutional neural networks (CNNs) and pre-trained foundation models.<n>Our results show that, overall, for the 2D datasets, foundation models deliver superior performance by a large margin compared to CNNs.<n>Our findings confirm that while using larger image sizes (especially for 2D datasets) yields slightly better performance, competitive CBMIR performance can still be achieved even with smaller image
arXiv Detail & Related papers (2024-09-14T13:07:30Z)
Flemme: A Flexible and Modular Learning Platform for Medical Images [5.086862917025204]
Flemme is a FLExible and Modular learning platform for MEdical images.<n>We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches.
arXiv Detail & Related papers (2024-08-18T05:47:33Z)
Few-Shot Medical Image Segmentation with High-Fidelity Prototypes [38.073371773707514]
We propose a novel Detail Self-refined Prototype Network (DSPNet) to construct high-fidelity prototypes representing the object foreground and the background more comprehensively. To construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner.
arXiv Detail & Related papers (2024-06-26T05:06:14Z)
Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation [14.135249795318591]
We propose a novel perspective on self-prompting in medical vision applications. We harness the embedding space of the Segment Anything Model to prompt itself through a simple yet effective linear pixel-wise classifier. We achieve competitive results on multiple datasets.
arXiv Detail & Related papers (2023-08-15T08:20:07Z)
IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling. We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units. Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z)
UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation [14.873473285148853]
We introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Conal Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. We present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked tokens.
arXiv Detail & Related papers (2022-04-01T17:38:39Z)
InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model. This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z)
Hierarchical Amortized Training for Memory-efficient High Resolution 3D GAN [52.851990439671475]
We propose a novel end-to-end GAN architecture that can generate high-resolution 3D images. We achieve this goal by using different configurations between training and inference. Experiments on 3D thorax CT and brain MRI demonstrate that our approach outperforms state of the art in image generation.
arXiv Detail & Related papers (2020-08-05T02:33:04Z)
Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation. We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.