ReSTR: Convolution-free Referring Image Segmentation Using Transformers
- URL: http://arxiv.org/abs/2203.16768v1
- Date: Thu, 31 Mar 2022 02:55:39 GMT
- Title: ReSTR: Convolution-free Referring Image Segmentation Using Transformers
- Authors: Namyup Kim, Dongwon Kim, Cuiling Lan, Wenjun Zeng, Suha Kwak
- Abstract summary: We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
- Score: 80.9672131755143
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Referring image segmentation is an advanced semantic segmentation task where
target is not a predefined class but is described in natural language. Most of
existing methods for this task rely heavily on convolutional neural networks,
which however have trouble capturing long-range dependencies between entities
in the language expression and are not flexible enough for modeling
interactions between the two different modalities. To address these issues, we
present the first convolution-free model for referring image segmentation using
transformers, dubbed ReSTR. Since it extracts features of both modalities
through transformer encoders, it can capture long-range dependencies between
entities within each modality. Also, ReSTR fuses features of the two modalities
by a self-attention encoder, which enables flexible and adaptive interactions
between the two modalities in the fusion process. The fused features are fed to
a segmentation module, which works adaptively according to the image and
language expression in hand. ReSTR is evaluated and compared with previous work
on all public benchmarks, where it outperforms all existing models.
Related papers
- EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence.
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Branchformer: Parallel MLP-Attention Architectures to Capture Local and
Global Context for Speech Recognition and Understanding [41.928263518867816]
Conformer has proven to be effective in many speech processing tasks.
Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer.
arXiv Detail & Related papers (2022-07-06T21:08:10Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.