Related papers: Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

URL: http://arxiv.org/abs/2511.19516v2
Date: Wed, 26 Nov 2025 15:38:35 GMT
Title: Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
Authors: Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian,
Abstract summary: GroundingAgent is a visual grounding framework that operates without task-specific fine-tuning.<n>It achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks.<n>It also offers strong interpretability, transparently illustrating each reasoning step.
Score: 63.109585527799005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

Related papers

Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z)
SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning [31.665287327579026]
SpotAgent is a framework that formalizes geo-localization into an agentic reasoning process.<n>It actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram.<n>It achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
arXiv Detail & Related papers (2026-02-10T06:57:12Z)
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z)
RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow [19.502882116487005]
Remote sensing imagery presents vast, inherently unstructured spatial data.<n>We propose RemoteReasoner, a unified workflow for geospatial reasoning.<n>RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks.
arXiv Detail & Related papers (2025-07-25T13:58:11Z)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z)
From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.<n>This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)<n>We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z)
Global and Local Semantic Completion Learning for Vision-Language Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models. We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.