Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning
- URL: http://arxiv.org/abs/2212.10278v1
- Date: Sat, 17 Dec 2022 08:29:33 GMT
- Title: Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning
- Authors: Hui Li, Mingjie Sun, Jimin Xiao, Eng Gee Lim, and Yao Zhao
- Abstract summary: Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
- Score: 50.40482222266927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Segmentation (RES), which is aimed at localizing and
segmenting the target according to the given language expression, has drawn
increasing attention. Existing methods jointly consider the localization and
segmentation steps, which rely on the fused visual and linguistic features for
both steps. We argue that the conflict between the purpose of identifying an
object and generating a mask limits the RES performance. To solve this problem,
we propose a parallel position-kernel-segmentation pipeline to better isolate
and then interact the localization and segmentation steps. In our pipeline,
linguistic information will not directly contaminate the visual feature for
segmentation. Specifically, the localization step localizes the target object
in the image based on the referring expression, and then the visual kernel
obtained from the localization step guides the segmentation step. This pipeline
also enables us to train RES in a weakly-supervised way, where the pixel-level
segmentation labels are replaced by click annotations on center and corner
points. The position head is fully-supervised and trained with the click
annotations as supervision, and the segmentation head is trained with
weakly-supervised segmentation losses. To validate our framework on a
weakly-supervised setting, we annotated three RES benchmark datasets (RefCOCO,
RefCOCO+ and RefCOCOg) with click annotations.Our method is simple but
surprisingly effective, outperforming all previous state-of-the-art RES methods
on fully- and weakly-supervised settings by a large margin. The benchmark code
and datasets will be released.
Related papers
- Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised
Semantic Segmentation and Localization [98.46318529630109]
We take inspiration from traditional spectral segmentation methods by reframing image decomposition as a graph partitioning problem.
We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene.
By clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions.
arXiv Detail & Related papers (2022-05-16T17:47:44Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - SegGroup: Seg-Level Supervision for 3D Instance and Semantic
Segmentation [88.22349093672975]
We design a weakly supervised point cloud segmentation algorithm that only requires clicking on one point per instance to indicate its location for annotation.
With over-segmentation for pre-processing, we extend these location annotations into segments as seg-level labels.
We show that our seg-level supervised method (SegGroup) achieves comparable results with the fully annotated point-level supervised methods.
arXiv Detail & Related papers (2020-12-18T13:23:34Z) - SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A
Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information.
The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene.
We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.