Related papers: Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

URL: http://arxiv.org/abs/2602.23898v1
Date: Fri, 27 Feb 2026 10:47:26 GMT
Title: Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Authors: Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu,
Abstract summary: Ref-Adv is a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to identify the target.<n>The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation.<n>Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding.
Score: 65.37131487318273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Related papers

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions [42.03378622674476]
We propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions.<n>By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision.<n>Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations.
arXiv Detail & Related papers (2026-02-21T07:56:59Z)
RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension [45.091078689395864]
Referring Expression (REC) is a vision-language task that localizes a specific image region based on a textual description.<n>We introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning.<n>We also propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions.
arXiv Detail & Related papers (2025-12-06T03:59:21Z)
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [58.80001825332851]
Referring Image (RIS) aims to segment the target object in an image given a natural language expression.<n>Recent methods predominantly focus on simple expressions like "red car" or "left girl"
arXiv Detail & Related papers (2025-10-11T10:50:58Z)
Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection [58.82268659497348]
We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them.<n>We propose Forensic-Chat, a generalizable, explainable, and still-conversational assistant for fake image detection.
arXiv Detail & Related papers (2025-09-29T20:59:19Z)
KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge [1.5833270109954136]
We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
arXiv Detail & Related papers (2025-08-12T19:43:44Z)
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation [61.37076111486196]
Ref-AVS aims to segment target objects in audible videos based on given reference expressions.<n>We propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process.<n>Ref-Thinker is a multimodal language model capable of reasoning over textual, visual, and auditory cues.
arXiv Detail & Related papers (2025-08-06T13:05:09Z)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations [4.697267141773321]
Retrieval-augmented generation (RAG) has emerged as a critical mechanism in contemporary NLP to support Large Language Models (LLMs) in systematically accessing richer factual context. Recent studies have shown that LLMs still struggle to critically analyse RAG-based in-context information, a limitation that may lead to incorrect inferences and hallucinations. In this paper, we investigate how to elicit critical reasoning in RAG via contrastive explanations.
arXiv Detail & Related papers (2024-10-30T10:11:53Z)
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.