Related papers: Instance-Aware Generalized Referring Expression Segmentation

Instance-Aware Generalized Referring Expression Segmentation

URL: http://arxiv.org/abs/2411.15087v1
Date: Fri, 22 Nov 2024 17:28:43 GMT
Title: Instance-Aware Generalized Referring Expression Segmentation
Authors: E-Ro Nguyen, Hieu Le, Dimitris Samaras, Michael Ryoo,
Abstract summary: InstAlign is a method that incorporates object-level reasoning into the segmentation process. Our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
Score: 32.96760407482406
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Related papers

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation [29.88252418748085]
Referring Audio-Visual (Ref-AVS) aims to segment specific objects in videos based on natural language expressions.<n>This task poses significant challenges in cross-modal reasoning and fine-grained object localization.<n>We propose a framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM)
arXiv Detail & Related papers (2025-09-22T08:55:04Z)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
Latent Expression Generation for Referring Image Segmentation and Grounding [13.611995923070426]
Most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain.<n>This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects.<n>We propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input.
arXiv Detail & Related papers (2025-08-07T07:57:27Z)
Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy. We first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z)
Contrastive Grouping with Transformer for Referring Image Segmentation [23.276636282894582]
We propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer) CGFormer explicitly captures object-level information via token-based querying and grouping strategy. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
arXiv Detail & Related papers (2023-09-02T20:53:42Z)
Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression. We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z)
Synthetic Instance Segmentation from Semantic Image Segmentation Masks [15.477053085267404]
We propose a novel paradigm called Synthetic Instance (SISeg) SISeg instance segmentation results by leveraging image masks generated by existing semantic segmentation models. In other words, the proposed model does not need extra manpower or higher computational expenses.
arXiv Detail & Related papers (2023-08-02T05:13:02Z)
Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff" Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
arXiv Detail & Related papers (2023-07-03T06:02:15Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint. Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z)
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations. Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z)
Instance-Specific Feature Propagation for Referring Segmentation [28.58551450280675]
Referring segmentation aims to generate a segmentation mask for the target instance indicated by a natural language expression. We propose a novel framework that simultaneously detects the target-of-interest via feature propagation and generates a fine-grained segmentation mask.
arXiv Detail & Related papers (2022-04-26T07:08:14Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.