Related papers: AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V: Adaptive Tool-Use for Images and Videos

URL: http://arxiv.org/abs/2512.16918v2
Date: Fri, 19 Dec 2025 20:51:45 GMT
Title: AdaTooler-V: Adaptive Tool-Use for Images and Videos
Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue,
Abstract summary: AdaTooler-V is an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools.<n>AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro.
Score: 36.66944857910871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

Related papers

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z)
AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning [79.65732142949014]
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories.<n>Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets.<n>We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories.
arXiv Detail & Related papers (2025-12-15T12:38:04Z)
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL [33.692408134748696]
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning.<n>We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools.<n>Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks.
arXiv Detail & Related papers (2025-12-03T18:50:04Z)
Reinforced Visual Perception with Tools [66.79840157663237]
We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools.<n>We show that our method achieves state-of-the-art performance on several perception-heavy benchmarks.<n>Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench.
arXiv Detail & Related papers (2025-09-01T17:57:49Z)
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools [9.788417605537965]
We introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge the gap in real-world tool-use proficiency.<n>ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions.<n>To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism.
arXiv Detail & Related papers (2025-08-05T10:06:16Z)
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [63.2198957755528]
We propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations.<n>Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories.<n>Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback.
arXiv Detail & Related papers (2025-06-05T04:35:49Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models [47.145844910856134]
Tool learning aims to augment large language models with diverse tools, enabling them to act as agents for solving practical tasks.<n>Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step.<n>Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios.<n>We propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from
arXiv Detail & Related papers (2025-03-03T17:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.