ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
- URL: http://arxiv.org/abs/2506.08678v1
- Date: Tue, 10 Jun 2025 10:40:10 GMT
- Title: ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
- Authors: Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim,
- Abstract summary: Any-to-Any Self-Distillation (ATAS) is a novel approach that simultaneously enhances semantic coherence and fine-grained alignment.<n>ATAS achieves substantial performance gains on open-vocabulary object detection and semantic segmentation benchmarks.
- Score: 3.7365850182404845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.
Related papers
- Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score [11.74414842618874]
We show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels.<n>We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings.<n>Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78%.
arXiv Detail & Related papers (2025-07-13T12:38:38Z) - Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach [43.419607730361996]
Vision-Language Models (VLMs) like CLIP achieve cross-modal alignment through contrastive learning.<n>Traditional prompt engineering relies on coarse-grained category labels, neglecting fine-grained local semantics.<n>We propose a plug-and-play solution that enables CLIP to process localized visual descriptors.
arXiv Detail & Related papers (2025-07-04T10:24:26Z) - Refining CLIP's Spatial Awareness: A Visual-Centric Perspective [10.936397225984107]
Contrastive Language-Image Pre-training excels in global alignment with language but exhibits limited sensitivity to spatial information.<n>Recent approaches have introduced Region-Language Alignment to enhance CLIP's performance in dense multimodal tasks.<n>We propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation.
arXiv Detail & Related papers (2025-04-03T07:04:56Z) - Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.<n>Existing SHGL methods encounter two significant limitations.<n>We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [68.41025728960176]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.<n>We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP
Limitations [9.444540281544715]
We introduce a novel agent for active open-vocabulary recognition.
The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge.
arXiv Detail & Related papers (2023-11-28T19:24:07Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z) - Deep Clustering by Semantic Contrastive Learning [67.28140787010447]
We introduce a novel variant called Semantic Contrastive Learning (SCL)
It explores the characteristics of both conventional contrastive learning and deep clustering.
It can amplify the strengths of contrastive learning and deep clustering in a unified approach.
arXiv Detail & Related papers (2021-03-03T20:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.