Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2510.23894v1
- Date: Mon, 27 Oct 2025 22:05:08 GMT
- Title: Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation
- Authors: Jinxin Zhou, Jiachen Jiang, Zhihui Zhu,
- Abstract summary: LHT-CLIP is a training-free framework that exploits the visual discriminability of CLIP across layer, head, and token levels.<n>It achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
- Score: 20.30263242388691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
Related papers
- Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z) - AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.<n>Existing SHGL methods encounter two significant limitations.<n>We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - 2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic
Segmentation [92.17700318483745]
We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network.
IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points.
arXiv Detail & Related papers (2023-11-27T07:57:29Z) - Instance Adaptive Prototypical Contrastive Embedding for Generalized
Zero Shot Learning [11.720039414872296]
Generalized zero-shot learning aims to classify samples from seen and unseen labels, assuming unseen labels are not accessible during training.
Recent advancements in GZSL have been expedited by incorporating contrastive-learning-based embedding in generative networks.
arXiv Detail & Related papers (2023-09-13T14:26:03Z) - EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition [27.199124692225777]
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance.
We propose EnTri, a framework that employs ensemble learning using a hierarchy of visual features.
EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches.
arXiv Detail & Related papers (2023-07-23T22:11:23Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.