Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation
- URL: http://arxiv.org/abs/2602.07550v1
- Date: Sat, 07 Feb 2026 13:51:53 GMT
- Title: Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation
- Authors: Hussni Mohd Zakir, Eric Tatt Wei Ho,
- Abstract summary: Recent self-supervised Vision Transformers (ViTs) provide rich feature representations for dense vision tasks.<n>This study investigates the few-shot semantic segmentation capabilities of frozen DINOv3 features through a training-free baseline.<n>We conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.
Related papers
- MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification [9.952997875404634]
Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids.<n>We propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering.<n> Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable resources.
arXiv Detail & Related papers (2026-02-16T06:43:36Z) - TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images [10.375329759512702]
TSE-Net is a self-training pipeline for semi-supervised monocular height estimation.<n>The pipeline integrates teacher, student, and exam networks.<n>We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities.
arXiv Detail & Related papers (2025-11-17T16:22:38Z) - Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss [52.28880405119483]
Unsupervised online 3D instance segmentation is a fundamental yet challenging task.<n>Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity.<n>We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation.
arXiv Detail & Related papers (2025-09-27T08:53:27Z) - Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z) - Improving Open-Set Semantic Segmentation in 3D Point Clouds by Conditional Channel Capacity Maximization: Preliminary Results [1.1328543389752008]
We propose a plug and play framework for Open-Set Semantic (O3S)<n>By modeling the segmentation pipeline as a conditional Markov chain, we derive a novel regularizer term dubbed Conditional Channel Capacity Maximization (3CM)<n>We show that 3CM encourages the encoder to retain richer, label-dependent features, thereby enhancing the network's ability to distinguish and segment previously unseen categories.
arXiv Detail & Related papers (2025-05-09T04:12:26Z) - Hybrid Multi-Stage Learning Framework for Edge Detection: A Survey [0.0]
This paper introduces a Hybrid Multi-Stage Learning Framework that integrates Convolutional Neural Network (CNN) feature extraction with a Support Vector Machine (SVM)<n>Our approach decouples feature representation and classification stages, enhancing robustness and interpretability.
arXiv Detail & Related papers (2025-03-26T13:06:31Z) - OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation [69.37484603556307]
Un Semantic segmenting (USS) involves segmenting images without relying on predefined labels.
We introduce a novel approach called Optimally Matched Hierarchy (OMH) to simultaneously address the above issues.
Our OMH yields better unsupervised segmentation performance compared to existing USS methods.
arXiv Detail & Related papers (2024-03-11T09:46:41Z) - Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud
Semantic Segmentation via Decoupling Optimization [64.36097398869774]
Semi-supervised learning (SSL) has been an active research topic for large-scale 3D scene understanding.
The existing SSL-based methods suffer from severe training bias due to class imbalance and long-tail distributions of the point cloud data.
We introduce a new decoupling optimization framework, which disentangles feature representation learning and classifier in an alternative optimization manner to shift the bias decision boundary effectively.
arXiv Detail & Related papers (2024-01-13T04:16:40Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - Self-Ensembling GAN for Cross-Domain Semantic Segmentation [107.27377745720243]
This paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation.
In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN.
Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model.
arXiv Detail & Related papers (2021-12-15T09:50:25Z) - Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results.
Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples.
Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.