Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation
- URL: http://arxiv.org/abs/2310.19001v1
- Date: Sun, 29 Oct 2023 13:18:00 GMT
- Title: Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation
- Authors: Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao
Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
- Abstract summary: This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
- Score: 59.37587762543934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the problem of weakly open-vocabulary semantic
segmentation (WOVSS), which learns to segment objects of arbitrary classes
using mere image-text pairs. Existing works turn to enhance the vanilla vision
transformer by introducing explicit grouping recognition, i.e., employing
several group tokens/centroids to cluster the image tokens and perform the
group-text alignment. Nevertheless, these methods suffer from a granularity
inconsistency regarding the usage of group tokens, which are aligned in the
all-to-one v.s. one-to-one manners during the training and inference phases,
respectively. We argue that this discrepancy arises from the lack of elaborate
supervision for each group token. To bridge this granularity gap, this paper
explores explicit supervision for the group tokens from the prototypical
knowledge. To this end, this paper proposes the non-learnable prototypical
regularization (NPR) where non-learnable prototypes are estimated from source
features to serve as supervision and enable contrastive matching of the group
tokens. This regularization encourages the group tokens to segment objects with
less redundancy and capture more comprehensive semantic regions, leading to
increased compactness and richness. Based on NPR, we propose the prototypical
guidance segmentation network (PGSeg) that incorporates multi-modal
regularization by leveraging prototypical sources from both images and texts at
different levels, progressively enhancing the segmentation capability with
diverse prototypical patterns. Experimental results show that our proposed
method achieves state-of-the-art performance on several benchmark datasets. The
source code is available at https://github.com/Ferenas/PGSeg.
Related papers
- Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SecViT, which attains an impressive textbf84.2% image classification accuracy with only textbf27M parameters and textbf4.4G FLOPs.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based
Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations.
Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior.
Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - PUPS: Point Cloud Unified Panoptic Segmentation [13.668363631123649]
We propose a simple but effective point cloud unified panoptic segmentation (PUPS) framework.
PUPS uses a set of point-level classifiers to directly predict semantic and instance groupings in an end-to-end manner.
PUPS achieves 1st place on the leader board of Semantic KITTI panoptic segmentation task and state-of-the-art results on nuScenes.
arXiv Detail & Related papers (2023-02-13T08:42:41Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Beyond the Prototype: Divide-and-conquer Proxies for Few-shot
Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples.
We propose a simple yet versatile framework in the spirit of divide-and-conquer.
Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.