GroupViT: Semantic Segmentation Emerges from Text Supervision
- URL: http://arxiv.org/abs/2202.11094v1
- Date: Tue, 22 Feb 2022 18:56:04 GMT
- Title: GroupViT: Semantic Segmentation Emerges from Text Supervision
- Authors: Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel,
Jan Kautz, Xiaolong Wang
- Abstract summary: Grouping and recognition are important components of visual scene understanding.
We propose a hierarchical Grouping Vision Transformer (GroupViT)
GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner.
- Score: 82.02467579704091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grouping and recognition are important components of visual scene
understanding, e.g., for object detection and semantic segmentation. With
end-to-end deep learning systems, grouping of image regions usually happens
implicitly via top-down supervision from pixel-level recognition labels.
Instead, in this paper, we propose to bring back the grouping mechanism into
deep networks, which allows semantic segments to emerge automatically with only
text supervision. We propose a hierarchical Grouping Vision Transformer
(GroupViT), which goes beyond the regular grid structure representation and
learns to group image regions into progressively larger arbitrary-shaped
segments. We train GroupViT jointly with a text encoder on a large-scale
image-text dataset via contrastive losses. With only text supervision and
without any pixel-level annotations, GroupViT learns to group together semantic
regions and successfully transfers to the task of semantic segmentation in a
zero-shot manner, i.e., without any further fine-tuning. It achieves a
zero-shot accuracy of 51.2% mIoU on the PASCAL VOC 2012 and 22.3% mIoU on
PASCAL Context datasets, and performs competitively to state-of-the-art
transfer-learning methods requiring greater levels of supervision. Project page
is available at https://jerryxu.net/GroupViT.
Related papers
- FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation [63.31007867379312]
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions.
A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information.
In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information.
We propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation.
arXiv Detail & Related papers (2025-01-01T15:47:04Z) - A Lightweight Clustering Framework for Unsupervised Semantic
Segmentation [28.907274978550493]
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data.
We propose a lightweight clustering framework for unsupervised semantic segmentation.
Our framework achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2023-11-30T15:33:42Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - TextPSG: Panoptic Scene Graph Generation from Textual Descriptions [78.1140391134517]
We study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG)
The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs.
We propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator.
arXiv Detail & Related papers (2023-10-10T22:36:15Z) - Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Unsupervised Hierarchical Semantic Segmentation with Multiview
Cosegmentation and Clustering Transformers [47.45830503277631]
Grouping naturally has levels of granularity, creating ambiguity in unsupervised segmentation.
We deliver the first data-driven unsupervised hierarchical semantic segmentation method called Hierarchical Segment Grouping (HSG)
arXiv Detail & Related papers (2022-04-25T04:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.