Vision-Language Transformer and Query Generation for Referring
Segmentation
- URL: http://arxiv.org/abs/2108.05565v1
- Date: Thu, 12 Aug 2021 07:24:35 GMT
- Title: Vision-Language Transformer and Query Generation for Referring
Segmentation
- Authors: Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang
- Abstract summary: We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
- Score: 39.01244764840372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we address the challenging task of referring segmentation. The
query expression in referring segmentation typically indicates the target
object by describing its relationship with others. Therefore, to find the
target one among all instances in the image, the model must have a holistic
understanding of the whole image. To achieve this, we reformulate referring
segmentation as a direct attention problem: finding the region in the image
where the query language expression is most attended to. We introduce
transformer and multi-head attention to build a network with an encoder-decoder
attention mechanism architecture that "queries" the given image with the
language expression. Furthermore, we propose a Query Generation Module, which
produces multiple sets of queries with different attention weights that
represent the diversified comprehensions of the language expression from
different aspects. At the same time, to find the best way from these
diversified comprehensions based on visual clues, we further propose a Query
Balance Module to adaptively select the output features of these queries for a
better mask generation. Without bells and whistles, our approach is
light-weight and achieves new state-of-the-art performance consistently on
three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code
is available at https://github.com/henghuiding/Vision-Language-Transformer.
Related papers
- OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer.
To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM)
Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z) - Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer)
CGFormer explicitly captures object-level information via token-based querying and grouping strategy.
Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z) - EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence.
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z) - MMNet: Multi-Mask Network for Referring Image Segmentation [6.462622145673872]
We propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet)
We first combine picture and language then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression.
The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression.
arXiv Detail & Related papers (2023-05-24T10:02:27Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z) - VLT: Vision-Language Transformer and Query Generation for Referring
Segmentation [31.051579752237746]
We propose a framework for referring segmentation to facilitate deep interactions among multi-modal information.
We introduce masked contrastive learning to narrow down the features of different expressions for the same target object.
The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.
arXiv Detail & Related papers (2022-10-28T03:36:07Z) - ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.