Related papers: Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation

Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2404.08181v2
Date: Mon, 16 Sep 2024 20:11:48 GMT
Title: Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation
Authors: Sina Hajimiri, Ismail Ben Ayed, Jose Dolz,
Abstract summary: vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks. We propose a baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP) Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature.
Score: 19.20874993309959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the significant progress in deep learning for dense visual recognition problems, such as semantic segmentation, traditional methods are constrained by fixed class sets. Meanwhile, vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks, owing to their robust generalizability. Recently, a body of work has investigated utilizing these models in open-vocabulary semantic segmentation (OVSS). However, existing approaches often rely on impractical supervised pre-training or access to additional pre-trained networks. In this work, we propose a strong baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of CLIP tailored for this scenario. Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature. By incorporating design choices favouring segmentation, our approach significantly improves performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning, making it highly practical for real-world applications. Experiments are performed on 8 popular semantic segmentation benchmarks, yielding state-of-the-art performance on most scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP.

Related papers

Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation [22.409969687852506]
We propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image.<n>Our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2025-12-08T10:00:36Z)
Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training [11.897717409259492]
Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models.<n>We propose GLARE, a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance.
arXiv Detail & Related papers (2025-09-22T14:11:02Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation [72.87707878910896]
FreeCP is a training-free class purification framework for semantic segmentation.<n>We conduct experiments across eight benchmarks to validate FreeCP's effectiveness.<n>Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.
arXiv Detail & Related papers (2025-08-01T11:55:12Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation [5.296260279593993]
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. We propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment.
arXiv Detail & Related papers (2025-03-11T21:38:34Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation [38.16802763051431]
We propose CLIPtrase, a training-free semantic segmentation strategy. It enhances local feature awareness through recalibrated self-correlation among patches. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks.
arXiv Detail & Related papers (2024-07-11T08:12:16Z)
Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts. We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z)
Test-Time Training for Semantic Segmentation with Output Contrastive Loss [12.535720010867538]
Deep learning-based segmentation models have achieved impressive performance on public benchmarks, but generalizing well to unseen environments remains a major challenge. This paper introduces Contrastive Loss (OCL), known for its capability to learn robust and generalized representations, to stabilize the adaptation process. Our method excels even when applied to models initially pre-trained using domain adaptation methods on test domain data, showcasing its resilience and adaptability.
arXiv Detail & Related papers (2023-11-14T03:13:47Z)
Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks. We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.