Locate then Segment: A Strong Pipeline for Referring Image Segmentation
- URL: http://arxiv.org/abs/2103.16284v1
- Date: Tue, 30 Mar 2021 12:25:27 GMT
- Title: Locate then Segment: A Strong Pipeline for Referring Image Segmentation
- Authors: Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, Tieniu Tan
- Abstract summary: Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
- Score: 73.19139431806853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring image segmentation aims to segment the objects referred by a
natural language expression. Previous methods usually focus on designing an
implicit and recurrent feature interaction mechanism to fuse the
visual-linguistic features to directly generate the final segmentation mask
without explicitly modeling the localization information of the referent
instances. To tackle these problems, we view this task from another perspective
by decoupling it into a "Locate-Then-Segment" (LTS) scheme. Given a language
expression, people generally first perform attention to the corresponding
target image regions, then generate a fine segmentation mask about the object
based on its context. The LTS first extracts and fuses both visual and textual
features to get a cross-modal representation, then applies a cross-model
interaction on the visual-textual features to locate the referred object with
position prior, and finally generates the segmentation result with a
light-weight segmentation network. Our LTS is simple but surprisingly
effective. On three popular benchmark datasets, the LTS outperforms all the
previous state-of-the-art methods by a large margin (e.g., +3.2% on RefCOCO+
and +3.4% on RefCOCOg). In addition, our model is more interpretable with
explicitly locating the object, which is also proved by visualization
experiments. We believe this framework is promising to serve as a strong
baseline for referring image segmentation.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - HARIS: Human-Like Attention for Reference Image Segmentation [5.808325471170541]
We propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism.
Our method achieves state-of-the-art performance and great zero-shot ability.
arXiv Detail & Related papers (2024-05-17T11:29:23Z) - Collaborative Position Reasoning Network for Referring Image
Segmentation [30.414910144177757]
We propose a novel method to explicitly model entity localization, especially for non-salient entities.
To our knowledge, this is the first work that explicitly focuses on position reasoning modeling.
arXiv Detail & Related papers (2024-01-22T09:11:12Z) - EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence.
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.