GRES: Generalized Referring Expression Segmentation
- URL: http://arxiv.org/abs/2306.00968v1
- Date: Thu, 1 Jun 2023 17:57:32 GMT
- Title: GRES: Generalized Referring Expression Segmentation
- Authors: Chang Liu, Henghui Ding, Xudong Jiang
- Abstract summary: We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
- Score: 32.12725360752345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES
datasets and methods commonly support single-target expressions only, i.e., one
expression refers to one target object. Multi-target and no-target expressions
are not considered. This limits the usage of RES in practice. In this paper, we
introduce a new benchmark called Generalized Referring Expression Segmentation
(GRES), which extends the classic RES to allow expressions to refer to an
arbitrary number of target objects. Towards this, we construct the first
large-scale GRES dataset called gRefCOCO that contains multi-target, no-target,
and single-target expressions. GRES and gRefCOCO are designed to be
well-compatible with RES, facilitating extensive experiments to study the
performance gap of the existing RES methods on the GRES task. In the
experimental study, we find that one of the big challenges of GRES is complex
relationship modeling. Based on this, we propose a region-based GRES baseline
ReLA that adaptively divides the image into regions with sub-instance clues,
and explicitly models the region-region and region-language dependencies. The
proposed approach ReLA achieves new state-of-the-art performance on the both
newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and
method are available at https://henghuiding.github.io/GRES.
Related papers
- CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios.
Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification.
We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z) - Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references.
Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules.
We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation [38.0788558329856]
We build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions.
Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task.
arXiv Detail & Related papers (2023-12-13T09:29:45Z) - GREC: Generalized Referring Expression Comprehension [52.83101289813662]
This study introduces a new benchmark termed as Generalized Referring Expression (GREC)
This benchmark extends the classic REC by permitting expressions to describe any number of target objects.
To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO.
arXiv Detail & Related papers (2023-08-30T17:58:50Z) - Advancing Referring Expression Segmentation Beyond Single Image [12.234097959235417]
We propose a more realistic and general setting, named Group-wise Referring Expression (GRES)
GRES expands to a collection of related images, allowing the described objects to be present in a subset of input images.
We introduce an elaborately compiled dataset named Grouped Referring (GRD), containing complete group-wise annotations of target objects described by given expressions.
arXiv Detail & Related papers (2023-05-21T13:14:28Z) - CLIP the Gap: A Single Domain Generalization Approach for Object
Detection [60.20931827772482]
Single Domain Generalization tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain.
We propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts.
We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss.
arXiv Detail & Related papers (2023-01-13T12:01:18Z) - Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z) - Learning Non-target Knowledge for Few-shot Semantic Segmentation [160.69431034807437]
We propose a novel framework, namely Non-Target Region Eliminating (NTRE) network, to explicitly mine and eliminate BG and DO regions in the query.
A BG Mining Module (BGMM) is proposed to extract the BG region via learning a general BG prototype.
A BG Eliminating Module and a DO Eliminating Module are proposed to successively filter out the BG and DO information from the query feature.
arXiv Detail & Related papers (2022-05-10T13:52:48Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.