VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
- URL: http://arxiv.org/abs/2505.20289v1
- Date: Mon, 26 May 2025 17:59:17 GMT
- Title: VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
- Authors: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee,
- Abstract summary: VisTA is a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance.<n>We show that VisTA achieves substantial performance gains over training-free baselines.<n>These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
- Score: 39.853940586221924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Related papers
- ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z) - AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z) - Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs [76.47326680870783]
We introduce VISTA-Gym, a training environment for incentivizing tool-integrated visual reasoning capabilities in vision-language models (VLMs)<n> VISTA-Gym unifies diverse real-world multimodal reasoning tasks with a standardized interface for visual tools.<n>We show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72% in 11 public reasoning-intensive VQA benchmarks.
arXiv Detail & Related papers (2025-11-24T22:58:26Z) - OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z) - ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.<n>Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.<n>We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z) - ToolACE-R: Tool Learning with Adaptive Self-Refinement [84.69651852838794]
Tool learning allows Large Language Models to leverage external tools for solving complex user tasks.<n>We propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations.<n>Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes.
arXiv Detail & Related papers (2025-04-02T06:38:56Z) - From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [60.733557487886635]
This paper focuses on bridging the comprehension gap between Large Language Models and external tools.<n>We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation.<n>This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases.
arXiv Detail & Related papers (2024-10-10T17:58:44Z) - LLM With Tools: A Survey [0.0]
This paper delves into the methodology,challenges, and developments in the realm of teaching LLMs to use external tools.
We introduce a standardized paradigm for tool integration guided by a series of functions that map user instructions to actionable plans.
Our exploration reveals the various challenges encountered, such as tool invocation timing, selection accuracy, and the need for robust reasoning processes.
arXiv Detail & Related papers (2024-09-24T14:08:11Z) - Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [26.28459880766842]
We propose a decision-aware and generalizable tool-usage framework (DEER)
Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline.
Our proposed DEER is effective and significantly outperforms baselines across various datasets.
arXiv Detail & Related papers (2024-02-26T16:11:03Z) - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [69.59482029810198]
CLOVA is a Closed-Loop Visual Assistant that operates within a framework encompassing inference, reflection, and learning phases.
Results demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing.
arXiv Detail & Related papers (2023-12-18T03:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.