Related papers: SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

URL: http://arxiv.org/abs/2512.04069v1
Date: Wed, 03 Dec 2025 18:50:04 GMT
Title: SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Authors: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay,
Abstract summary: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning.<n>We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools.<n>Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks.
Score: 33.692408134748696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

Related papers

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs [76.47326680870783]
We introduce VISTA-Gym, a training environment for incentivizing tool-integrated visual reasoning capabilities in vision-language models (VLMs)<n> VISTA-Gym unifies diverse real-world multimodal reasoning tasks with a standardized interface for visual tools.<n>We show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72% in 11 public reasoning-intensive VQA benchmarks.
arXiv Detail & Related papers (2025-11-24T22:58:26Z)
DeepEyesV2: Toward Agentic Multimodal Model [3.775371242454792]
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning.<n>We introduce DeepEyesV2, and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation.<n>We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks.
arXiv Detail & Related papers (2025-11-07T14:31:20Z)
Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch [63.40752011615843]
Training tool-augmented language models has emerged as a promising approach to enhancing their capabilities for complex tasks.<n>We propose a dynamic generalization-guided reward design for rule-based reinforcement learning.<n>We show that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models.
arXiv Detail & Related papers (2025-11-02T16:33:45Z)
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [78.29315418819074]
We introduce VerlTool, a unified and modular framework that addresses limitations through systematic design principles.<n>Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms.<n>The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions.
arXiv Detail & Related papers (2025-09-01T01:45:18Z)
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [63.31585771716123]
Large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL)<n>We introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.<n>Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training.
arXiv Detail & Related papers (2025-05-22T09:00:19Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.<n>Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.<n>We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.