Related papers: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

URL: http://arxiv.org/abs/2511.10899v1
Date: Fri, 14 Nov 2025 02:21:34 GMT
Title: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Authors: Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka,
Abstract summary: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity.<n>We show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning.
Score: 18.072434766310458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.

Related papers

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning [11.99927786717109]
ToolMATH turns math problems into a controlled, correctness-checkable benchmark with tool sets.<n>ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents.
arXiv Detail & Related papers (2026-02-24T09:23:12Z)
Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls [56.407063247662336]
We introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs.<n>GATS consistently improves the tool calling performance of various LLMs including GPT-4o, GPT-5, and Gemini-3.0-pro.
arXiv Detail & Related papers (2026-02-22T15:02:00Z)
ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset [43.45582911794623]
We introduce ToolMind, a high-quality tool-agentic dataset with 160k synthetic data instances.<n>We employ fine-grained turn-level filtering to remove erroneous or suboptimal steps.<n>Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
arXiv Detail & Related papers (2025-11-12T13:01:23Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Self-Training Large Language Models for Tool-Use Without Demonstrations [15.17750971071501]
Large language models (LLMs) remain prone to factual inaccuracies and computational errors.<n>Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations.<n>This paper investigates whether LLMs can learn to use tools without demonstrations.
arXiv Detail & Related papers (2025-02-09T12:06:10Z)
Meta-Reasoning Improves Tool Use in Large Language Models [10.193264105560864]
We present Tool selECTion via meta-reasONing (TECTON), a two-phase system that first reasons over a task and outputs candidate tools.<n>TECTON results in substantial gains--both in-distribution and out-of-distribution--on a range of math reasoning datasets.
arXiv Detail & Related papers (2024-11-07T08:48:33Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Multi-tool Integration Application for Math Reasoning Using Large Language Model [1.4582633500696451]
This article proposes a novel multi tool application framework for mathematical reasoning. It aims to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools.
arXiv Detail & Related papers (2024-08-22T06:27:10Z)
What Are Tools Anyway? A Survey from the Language Model Perspective [67.18843218893416]
Language models (LMs) are powerful yet mostly for text generation tasks. We provide a unified definition of tools as external programs used by LMs. We empirically study the efficiency of various tooling methods.
arXiv Detail & Related papers (2024-03-18T17:20:07Z)
Efficient Tool Use with Chain-of-Abstraction Reasoning [63.08202389132155]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.<n>There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.<n>We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.