GroupViT: Semantic Segmentation Emerges from Text Supervision
- URL: http://arxiv.org/abs/2202.11094v1
- Date: Tue, 22 Feb 2022 18:56:04 GMT
- Title: GroupViT: Semantic Segmentation Emerges from Text Supervision
- Authors: Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel,
Jan Kautz, Xiaolong Wang
- Abstract summary: Grouping and recognition are important components of visual scene understanding.
We propose a hierarchical Grouping Vision Transformer (GroupViT)
GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner.
- Score: 82.02467579704091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grouping and recognition are important components of visual scene
understanding, e.g., for object detection and semantic segmentation. With
end-to-end deep learning systems, grouping of image regions usually happens
implicitly via top-down supervision from pixel-level recognition labels.
Instead, in this paper, we propose to bring back the grouping mechanism into
deep networks, which allows semantic segments to emerge automatically with only
text supervision. We propose a hierarchical Grouping Vision Transformer
(GroupViT), which goes beyond the regular grid structure representation and
learns to group image regions into progressively larger arbitrary-shaped
segments. We train GroupViT jointly with a text encoder on a large-scale
image-text dataset via contrastive losses. With only text supervision and
without any pixel-level annotations, GroupViT learns to group together semantic
regions and successfully transfers to the task of semantic segmentation in a
zero-shot manner, i.e., without any further fine-tuning. It achieves a
zero-shot accuracy of 51.2% mIoU on the PASCAL VOC 2012 and 22.3% mIoU on
PASCAL Context datasets, and performs competitively to state-of-the-art
transfer-learning methods requiring greater levels of supervision. Project page
is available at https://jerryxu.net/GroupViT.
Related papers
- A Lightweight Clustering Framework for Unsupervised Semantic
Segmentation [28.907274978550493]
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data.
We propose a lightweight clustering framework for unsupervised semantic segmentation.
Our framework achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2023-11-30T15:33:42Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - TextPSG: Panoptic Scene Graph Generation from Textual Descriptions [78.1140391134517]
We study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG)
The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs.
We propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator.
arXiv Detail & Related papers (2023-10-10T22:36:15Z) - Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Unsupervised Hierarchical Semantic Segmentation with Multiview
Cosegmentation and Clustering Transformers [47.45830503277631]
Grouping naturally has levels of granularity, creating ambiguity in unsupervised segmentation.
We deliver the first data-driven unsupervised hierarchical semantic segmentation method called Hierarchical Segment Grouping (HSG)
arXiv Detail & Related papers (2022-04-25T04:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.