Linguistic Query-Guided Mask Generation for Referring Image Segmentation
- URL: http://arxiv.org/abs/2301.06429v3
- Date: Wed, 22 Mar 2023 12:01:42 GMT
- Title: Linguistic Query-Guided Mask Generation for Referring Image Segmentation
- Authors: Zhichao Wei, Xiaohao Chen, Mingqiang Chen, Siyu Zhu
- Abstract summary: Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
- Score: 10.130530501400079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims to segment the image region of interest
according to the given language expression, which is a typical multi-modal
task. Existing methods either adopt the pixel classification-based or the
learnable query-based framework for mask generation, both of which are
insufficient to deal with various text-image pairs with a fix number of
parametric prototypes. In this work, we propose an end-to-end framework built
on transformer to perform Linguistic query-Guided mask generation, dubbed
LGFormer. It views the linguistic features as query to generate a specialized
prototype for arbitrary input image-text pair, thus generating more consistent
segmentation results. Moreover, we design several cross-modal interaction
modules (\eg, vision-language bidirectional attention module, VLBA) in both
encoder and decoder to achieve better cross-modal alignment.
Related papers
- Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence.
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - MMNet: Multi-Mask Network for Referring Image Segmentation [6.462622145673872]
We propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet)
We first combine picture and language then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression.
The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression.
arXiv Detail & Related papers (2023-05-24T10:02:27Z) - Generalized Decoding for Pixel, Image, and Language [197.85760901840177]
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.
X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.
arXiv Detail & Related papers (2022-12-21T18:58:41Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Vision-Language Transformer and Query Generation for Referring
Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.