Contrastive Grouping with Transformer for Referring Image Segmentation
- URL: http://arxiv.org/abs/2309.01017v1
- Date: Sat, 2 Sep 2023 20:53:42 GMT
- Title: Contrastive Grouping with Transformer for Referring Image Segmentation
- Authors: Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang
- Abstract summary: We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
- Score: 23.276636282894582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims to segment the target referent in an image
conditioning on a natural language expression. Existing one-stage methods
employ per-pixel classification frameworks, which attempt straightforwardly to
align vision and language at the pixel level, thus failing to capture critical
object-level information. In this paper, we propose a mask classification
framework, Contrastive Grouping with Transformer network (CGFormer), which
explicitly captures object-level information via token-based querying and
grouping strategy. Specifically, CGFormer first introduces learnable query
tokens to represent objects and then alternately queries linguistic features
and groups visual features into the query tokens for object-aware cross-modal
reasoning. In addition, CGFormer achieves cross-level interaction by jointly
updating the query tokens and decoding masks in every two consecutive layers.
Finally, CGFormer cooperates contrastive learning to the grouping strategy to
identify the token and its mask corresponding to the referent. Experimental
results demonstrate that CGFormer outperforms state-of-the-art methods in both
segmentation and generalization settings consistently and significantly.
Related papers
- Instance-Aware Generalized Referring Expression Segmentation [32.96760407482406]
InstAlign is a method that incorporates object-level reasoning into the segmentation process.
Our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
arXiv Detail & Related papers (2024-11-22T17:28:43Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Learning Open-vocabulary Semantic Segmentation Models From Natural
Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories.
We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training.
Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z) - Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - GroupViT: Semantic Segmentation Emerges from Text Supervision [82.02467579704091]
Grouping and recognition are important components of visual scene understanding.
We propose a hierarchical Grouping Vision Transformer (GroupViT)
GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner.
arXiv Detail & Related papers (2022-02-22T18:56:04Z) - GANSeg: Learning to Segment by Unsupervised Hierarchical Image
Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks.
We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner.
It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.