Related papers: Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation

Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2310.19001v1
Date: Sun, 29 Oct 2023 13:18:00 GMT
Title: Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation
Authors: Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Abstract summary: This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS) Existing methods suffer from a granularity inconsistency regarding the usage of group tokens. We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
Score: 59.37587762543934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS), which learns to segment objects of arbitrary classes using mere image-text pairs. Existing works turn to enhance the vanilla vision transformer by introducing explicit grouping recognition, i.e., employing several group tokens/centroids to cluster the image tokens and perform the group-text alignment. Nevertheless, these methods suffer from a granularity inconsistency regarding the usage of group tokens, which are aligned in the all-to-one v.s. one-to-one manners during the training and inference phases, respectively. We argue that this discrepancy arises from the lack of elaborate supervision for each group token. To bridge this granularity gap, this paper explores explicit supervision for the group tokens from the prototypical knowledge. To this end, this paper proposes the non-learnable prototypical regularization (NPR) where non-learnable prototypes are estimated from source features to serve as supervision and enable contrastive matching of the group tokens. This regularization encourages the group tokens to segment objects with less redundancy and capture more comprehensive semantic regions, leading to increased compactness and richness. Based on NPR, we propose the prototypical guidance segmentation network (PGSeg) that incorporates multi-modal regularization by leveraging prototypical sources from both images and texts at different levels, progressively enhancing the segmentation capability with diverse prototypical patterns. Experimental results show that our proposed method achieves state-of-the-art performance on several benchmark datasets. The source code is available at https://github.com/Ferenas/PGSeg.

Related papers

Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [6.549601823162279]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z)
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations. Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs. Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation. We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer) CGFormer explicitly captures object-level information via token-based querying and grouping strategy. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
PUPS: Point Cloud Unified Panoptic Segmentation [13.668363631123649]
We propose a simple but effective point cloud unified panoptic segmentation (PUPS) framework. PUPS uses a set of point-level classifiers to directly predict semantic and instance groupings in an end-to-end manner. PUPS achieves 1st place on the leader board of Semantic KITTI panoptic segmentation task and state-of-the-art results on nuScenes.
arXiv Detail & Related papers (2023-02-13T08:42:41Z)
Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data. We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z)
Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples. We propose a simple yet versatile framework in the spirit of divide-and-conquer. Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z)
GroupViT: Semantic Segmentation Emerges from Text Supervision [82.02467579704091]
Grouping and recognition are important components of visual scene understanding. We propose a hierarchical Grouping Vision Transformer (GroupViT) GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner.
arXiv Detail & Related papers (2022-02-22T18:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.