HDC: Hierarchical Semantic Decoding with Counting Assistance for Generalized Referring Expression Segmentation
- URL: http://arxiv.org/abs/2405.15658v1
- Date: Fri, 24 May 2024 15:53:59 GMT
- Title: HDC: Hierarchical Semantic Decoding with Counting Assistance for Generalized Referring Expression Segmentation
- Authors: Zhuoyan Luo, Yinghao Wu, Yong Liu, Yicheng Xiao, Xiao-Ping Zhang, Yujiu Yang,
- Abstract summary: Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving multiple/non-target scenarios.
We propose a $textbfH$ierarchical Semantic $textbfD$ecoding with $textbfC$ounting Assistance framework (HDC)
We endow HDC with explicit counting capability to facilitate comprehensive object perception in multiple/single/non-target settings.
- Score: 33.40691116355158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving multiple/non-target scenarios. Recent approaches focus on optimizing the last modality-fused feature which is directly utilized for segmentation and object-existence identification. However, the attempt to integrate all-grained information into a single joint representation is impractical in GRES due to the increased complexity of the spatial relationships among instances and deceptive text descriptions. Furthermore, the subsequent binary target justification across all referent scenarios fails to specify their inherent differences, leading to ambiguity in object understanding. To address the weakness, we propose a $\textbf{H}$ierarchical Semantic $\textbf{D}$ecoding with $\textbf{C}$ounting Assistance framework (HDC). It hierarchically transfers complementary modality information across granularities, and then aggregates each well-aligned semantic correspondence for multi-level decoding. Moreover, with complete semantic context modeling, we endow HDC with explicit counting capability to facilitate comprehensive object perception in multiple/single/non-target settings. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of HDC which outperforms the state-of-the-art GRES methods by a remarkable margin. Code will be available $\href{https://github.com/RobertLuo1/HDC}{here}$.
Related papers
- Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.
It integrates implicit and explicit modeling approaches within a two-stage framework.
It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z) - Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension [46.07415235144545]
We address the challenging task of Generalized Referring Expression (GREC)
Existing REC methods face challenges in handling the complex cases encountered in GREC.
We propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G)
arXiv Detail & Related papers (2025-01-02T18:57:59Z) - Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation [69.01029651113386]
Embodied-RAG is a framework that enhances the model of an embodied agent with a non-parametric memory system.
At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail.
We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries.
arXiv Detail & Related papers (2024-09-26T21:44:11Z) - Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references.
Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules.
We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Beyond the Prototype: Divide-and-conquer Proxies for Few-shot
Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples.
We propose a simple yet versatile framework in the spirit of divide-and-conquer.
Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.