Related papers: Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

URL: http://arxiv.org/abs/2511.19773v1
Date: Mon, 24 Nov 2025 22:58:26 GMT
Title: Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Authors: Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang,
Abstract summary: We introduce VISTA-Gym, a training environment for incentivizing tool-integrated visual reasoning capabilities in vision-language models (VLMs)<n> VISTA-Gym unifies diverse real-world multimodal reasoning tasks with a standardized interface for visual tools.<n>We show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72% in 11 public reasoning-intensive VQA benchmarks.
Score: 76.47326680870783
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

Related papers

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning [25.75780053067891]
Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images.<n>We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis.
arXiv Detail & Related papers (2026-01-12T00:11:10Z)
Training Multi-Image Vision Agents via End2End Reinforcement Learning [51.81337984526068]
We propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning.<n>By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs.<n>We develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content.
arXiv Detail & Related papers (2025-12-05T10:02:38Z)
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL [33.692408134748696]
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning.<n>We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools.<n>Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks.
arXiv Detail & Related papers (2025-12-03T18:50:04Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Reinforced Visual Perception with Tools [66.79840157663237]
We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools.<n>We show that our method achieves state-of-the-art performance on several perception-heavy benchmarks.<n>Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench.
arXiv Detail & Related papers (2025-09-01T17:57:49Z)
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use [33.83255323522487]
We introduce VTool-R1, the first framework that trains vision-language models to generate multimodal chains of thought.<n>VTool-R1 integrates Python-based visual editing tools into theReinforcement Learning Finetuning process.
arXiv Detail & Related papers (2025-05-25T18:23:39Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.