GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
- URL: http://arxiv.org/abs/2601.05244v1
- Date: Thu, 08 Jan 2026 18:59:30 GMT
- Title: GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
- Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang,
- Abstract summary: This paper introduces three new benchmarks called Generalized Referring Expression (GRES), (GREC), and Generation (GREG)<n>GREx extends the classic REx to allow expressions to identify an arbitrary number of objects.<n>We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions.
- Score: 99.51887959226735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.
Related papers
- SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images [49.52402091341301]
Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios.<n>We present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation.<n>We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS.
arXiv Detail & Related papers (2025-12-23T03:10:17Z) - ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval [125.19156877994612]
Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids)<n>We propose textscZeroGR, a zero-shot generative retrieval framework that leverages natural language instructions to extend GR across a wide range of IR tasks.<n>Specifically, textscZeroGR is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance
arXiv Detail & Related papers (2025-10-12T03:04:24Z) - CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios.
Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification.
We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z) - Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references.<n>Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules.<n>We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - GREC: Generalized Referring Expression Comprehension [52.83101289813662]
This study introduces a new benchmark termed as Generalized Referring Expression (GREC)
This benchmark extends the classic REC by permitting expressions to describe any number of target objects.
To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO.
arXiv Detail & Related papers (2023-08-30T17:58:50Z) - GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z) - Advancing Referring Expression Segmentation Beyond Single Image [12.234097959235417]
We propose a more realistic and general setting, named Group-wise Referring Expression (GRES)
GRES expands to a collection of related images, allowing the described objects to be present in a subset of input images.
We introduce an elaborately compiled dataset named Grouped Referring (GRD), containing complete group-wise annotations of target objects described by given expressions.
arXiv Detail & Related papers (2023-05-21T13:14:28Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.