Related papers: SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

URL: http://arxiv.org/abs/2510.10160v1
Date: Sat, 11 Oct 2025 10:50:58 GMT
Title: SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Authors: Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang,
Abstract summary: Referring Image (RIS) aims to segment the target object in an image given a natural language expression.<n>Recent methods predominantly focus on simple expressions like "red car" or "left girl"
Score: 58.80001825332851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

Related papers

Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions [45.04317112354794]
Referring Remote Sensing Image aims to segment instances in remote sensing images according to referring expressions.<n>This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS.<n>We show that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions.
arXiv Detail & Related papers (2025-10-26T17:18:48Z)
ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z)
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis [46.502962768034166]
Zero-shot Referring Image identifies the instance mask that best aligns with a referring expression without training and fine-tuning.<n>Previous CLIP-based models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects.<n>IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.
arXiv Detail & Related papers (2025-03-02T15:19:37Z)
Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z)
Referring Expression Comprehension Using Language Adaptive Inference [15.09309604460633]
This paper explores the adaptation between expressions and REC models for dynamic inference. We propose a framework named Language Adaptive Subnets (LADS), which can extract language-adaptives from the REC model conditioned on the referring expressions. Experiments on RefCOCO, RefCO+, RefCOCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-06T07:58:59Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
Exploring Multi-Modal Representations for Ambiguity Detection & Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI. Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments. Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. We first examine the state of the art by comparing modern approaches to the problem. We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension [39.40351938417889]
Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. Some popular referring expression datasets fail to provide an ideal test bed for evaluating the reasoning ability of the models. We propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
arXiv Detail & Related papers (2020-03-01T04:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.