Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation
- URL: http://arxiv.org/abs/2403.08426v1
- Date: Wed, 13 Mar 2024 11:23:55 GMT
- Title: Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation
- Authors: Zicheng Zhang, Tong Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang,
QiXiang Ye, Wei Ke
- Abstract summary: We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
- Score: 114.72734384299476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-trained vision-language model, exemplified by CLIP, advances
zero-shot semantic segmentation by aligning visual features with class
embeddings through a transformer decoder to generate semantic masks. Despite
its effectiveness, prevailing methods within this paradigm encounter
challenges, including overfitting on seen classes and small fragmentation in
masks. To mitigate these issues, we propose a Language-Driven Visual Consensus
(LDVC) approach, fostering improved alignment of semantic and visual
information.Specifically, we leverage class embeddings as anchors due to their
discrete and abstract nature, steering vision features toward class embeddings.
Moreover, to circumvent noisy alignments from the vision part due to its
redundant nature, we introduce route attention into self-attention for finding
visual consensus, thereby enhancing semantic consistency within the same
object. Equipped with a vision-language prompting strategy, our approach
significantly boosts the generalization capacity of segmentation models for
unseen classes. Experimental results underscore the effectiveness of our
approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the
COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Masked Momentum Contrastive Learning for Zero-shot Semantic
Understanding [39.424931953675994]
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.
This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks.
arXiv Detail & Related papers (2023-08-22T13:55:57Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Delving into Shape-aware Zero-shot Semantic Segmentation [18.51025849474123]
We present textbfshape-aware zero-shot semantic segmentation.
Inspired by classical spectral methods, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features.
Our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO.
arXiv Detail & Related papers (2023-04-17T17:59:46Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Cluster-based Contrastive Disentangling for Generalized Zero-Shot
Learning [25.92340532509084]
Generalized Zero-Shot Learning (GZSL) aims to recognize both seen and unseen classes by training only the seen classes.
We propose a Cluster-based Contrastive Disentangling (CCD) method to improve GZSL by alleviating the semantic gap and domain shift problems.
arXiv Detail & Related papers (2022-03-05T02:50:12Z) - Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual
Class Embedding [0.0]
We propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet)
In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin.
arXiv Detail & Related papers (2021-11-30T07:39:19Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.