Related papers: RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

URL: http://arxiv.org/abs/2505.02867v1
Date: Sat, 03 May 2025 15:19:20 GMT
Title: RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Authors: Ruiqi Wang, Hao Zhang,
Abstract summary: We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES)<n>Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is prompting attribute.<n>We contribute a new benchmark dataset to offer 3K carefully curated RES instances to assess part-level, arbitrary RES solutions.
Score: 11.205928115216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object or part attributes related to function, style, design, etc., enabling the system to handle implicit queries without any part annotations for training or fine-tuning. As the first zero-shot and LLM-based RES method, RESAnything achieves clearly superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. Finally, we contribute a new benchmark dataset to offer ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

Related papers

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.<n>It adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding [10.04904999444546]
Referring expression comprehension aims at achieving object localization based on natural language descriptions.<n>Existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions.<n>We propose Multi-ref EC, a novel framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects.
arXiv Detail & Related papers (2025-03-25T00:59:58Z)
One-shot In-context Part Segmentation [97.77292483684877]
We present the One-shot In-context Part (OIParts) framework to tackle the challenges of part segmentation.<n>Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient.<n>We have achieved remarkable segmentation performance across diverse object categories.
arXiv Detail & Related papers (2025-03-03T03:50:54Z)
Instance-Aware Generalized Referring Expression Segmentation [32.96760407482406]
InstAlign is a method that incorporates object-level reasoning into the segmentation process. Our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
arXiv Detail & Related papers (2024-11-22T17:28:43Z)
Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references.<n>Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules.<n>We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z)
RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES) GRES allows expressions to refer to an arbitrary number of target objects. We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z)
Reflection Invariance Learning for Few-shot Semantic Segmentation [53.20466630330429]
Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images. This paper proposes a fresh few-shot segmentation framework to mine the reflection invariance in a multi-view matching manner. Experiments on both PASCAL-$5textiti$ and COCO-$20textiti$ datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-01T15:14:58Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.