Related papers: ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

URL: http://arxiv.org/abs/2602.21265v1
Date: Tue, 24 Feb 2026 09:23:12 GMT
Title: ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Authors: Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee,
Abstract summary: ToolMATH turns math problems into a controlled, correctness-checkable benchmark with tool sets.<n>ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents.
Score: 11.99927786717109
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.

Related papers

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z)
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z)
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models [18.072434766310458]
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity.<n>We show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning.
arXiv Detail & Related papers (2025-11-14T02:21:34Z)
ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning [80.10274552177096]
Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks.<n>The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools.<n>We propose a systematic approach to automatically an unstructured collection of tools into a structured tool library.
arXiv Detail & Related papers (2025-10-09T04:11:16Z)
TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [74.47746287181383]
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks.<n>We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability.
arXiv Detail & Related papers (2025-10-06T07:30:25Z)
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use [72.53177559476704]
We introduce MCPVerse, a real-world benchmark for evaluating agentic tool use.<n> MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens.<n>We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale)
arXiv Detail & Related papers (2025-08-22T09:47:53Z)
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models [8.573278807410507]
Tool learning can further broaden the usage scenarios of large language models (LLMs)<n>We present a new Tool Learning method Chain-of-Tools.<n>It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning.
arXiv Detail & Related papers (2025-03-21T01:26:12Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning. In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.