Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
- URL: http://arxiv.org/abs/2511.19516v2
- Date: Wed, 26 Nov 2025 15:38:35 GMT
- Title: Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
- Authors: Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian,
- Abstract summary: GroundingAgent is a visual grounding framework that operates without task-specific fine-tuning.<n>It achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks.<n>It also offers strong interpretability, transparently illustrating each reasoning step.
- Score: 63.109585527799005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
Related papers
- Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z) - SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning [31.665287327579026]
SpotAgent is a framework that formalizes geo-localization into an agentic reasoning process.<n>It actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram.<n>It achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
arXiv Detail & Related papers (2026-02-10T06:57:12Z) - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z) - RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z) - RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow [19.502882116487005]
Remote sensing imagery presents vast, inherently unstructured spatial data.<n>We propose RemoteReasoner, a unified workflow for geospatial reasoning.<n>RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks.
arXiv Detail & Related papers (2025-07-25T13:58:11Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.<n>This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)<n>We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.