Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
- URL: http://arxiv.org/abs/2504.01954v1
- Date: Wed, 02 Apr 2025 17:58:05 GMT
- Title: Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
- Authors: Jing Liu, Wenxuan Wang, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang,
- Abstract summary: Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression.<n>Traditional RES methods primarily address object-level grounding.<n>Real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity.<n>We propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks.
- Score: 36.506512800685066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios.
Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification.
We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z) - Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references.
Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules.
We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces [92.52589788633856]
We propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture.
With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time.
Our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network.
arXiv Detail & Related papers (2023-12-25T12:54:11Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation [38.0788558329856]
We build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions.
Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task.
arXiv Detail & Related papers (2023-12-13T09:29:45Z) - GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.