MMNet: Multi-Mask Network for Referring Image Segmentation
- URL: http://arxiv.org/abs/2305.14969v1
- Date: Wed, 24 May 2023 10:02:27 GMT
- Title: MMNet: Multi-Mask Network for Referring Image Segmentation
- Authors: Yichen Yan, Xingjian He, Wenxuan Wan, Jing Liu
- Abstract summary: We propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet)
We first combine picture and language then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression.
The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression.
- Score: 6.462622145673872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims to segment an object referred to by natural
language expression from an image. However, this task is challenging due to the
distinct data properties between text and image, and the randomness introduced
by diverse objects and unrestricted language expression. Most of previous work
focus on improving cross-modal feature fusion while not fully addressing the
inherent uncertainty caused by diverse objects and unrestricted language. To
tackle these problems, we propose an end-to-end Multi-Mask Network for
referring image segmentation(MMNet). we first combine picture and language and
then employ an attention mechanism to generate multiple queries that represent
different aspects of the language expression. We then utilize these queries to
produce a series of corresponding segmentation masks, assigning a score to each
mask that reflects its importance. The final result is obtained through the
weighted sum of all masks, which greatly reduces the randomness of the language
expression. Our proposed framework demonstrates superior performance compared
to state-of-the-art approaches on the two most commonly used datasets, RefCOCO,
RefCOCO+ and G-Ref, without the need for any post-processing. This further
validates the efficacy of our proposed framework.
Related papers
- Mask Grounding for Referring Image Segmentation [42.69973300692365]
Referring Image (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions.
Most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level.
We introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features.
arXiv Detail & Related papers (2023-12-19T14:34:36Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Completing Visual Objects via Bridging Generation and Segmentation [84.4552458720467]
MaskComp delineates the completion process through iterative stages of generation and segmentation.
In each iteration, the object mask is provided as an additional condition to boost image generation.
We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser.
arXiv Detail & Related papers (2023-10-01T22:25:40Z) - Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z) - Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z) - Mask Matching Transformer for Few-Shot Segmentation [71.32725963630837]
Mask Matching Transformer (MM-Former) is a new paradigm for the few-shot segmentation task.
First, the MM-Former follows the paradigm of decompose first and then blend, allowing our method to benefit from the advanced potential objects segmenter.
We conduct extensive experiments on the popular COCO-$20i$ and Pascal-$5i$ benchmarks.
arXiv Detail & Related papers (2022-12-05T11:00:32Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Vision-Language Transformer and Query Generation for Referring
Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.