Fugu-MT 論文翻訳(概要): From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

論文の概要: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

arxiv url: http://arxiv.org/abs/2511.10899v1
Date: Fri, 14 Nov 2025 02:21:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.393585
Title: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Title（参考訳）: 証明からプログラムへ:大規模言語モデルにおけるツールによる推論の表現
Authors: Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka,
Abstract要約: ツール拡張言語モデル(TaLM)は、パラメトリック能力を超えた問題を解決するために外部ツールを呼び出すことができる。ツールが正しく選択され、実行されたとしても、TaLMは推論の代用としてツール出力を扱います。
参考スコア（独自算出の注目度）: 18.072434766310458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
Abstract（参考訳）: ツール拡張言語モデル(TaLM)は、パラメトリック能力を超えた問題を解決するために外部ツールを呼び出すことができる。しかし、これらのツールによる利益が信頼できる推論を反映するかどうかは不明だ。 Code Interpreterツールに注目すると、ツールが正しく選択され、実行されたとしても、TaLMはツール出力を推論の代用として扱い、正しいように見えるが、一貫性のある正当性を欠いたソリューションを生成する。我々は、この障害モード Tool-induced Myopia (TIM) と呼び、Pythonコードが有用だが不十分な競合レベルの数学問題のベンチマークであるPYMATHを用いて研究する。さらに,多次元評価スイートを開発し,TaLMの非ツールに対する推論劣化を定量的に評価する。以上の結果から,TALMは最終回答精度で19.3%まで上昇するが,その推論行動は一貫して悪化することがわかった(例えば,非ツールLSMは2対比較で41.5%の確率で勝利する)。モデルがツールを呼び出す頻度が高ければ多いほど、その推論が一貫性を増す。さらに、ツールの使用は、エラーを算術ミスからグローバルな推論失敗(論理、仮定、創造性)にシフトさせる。最後に,TaLMを補助的エビデンスとして活用し,最終回答精度とツール使用時の推論深度を向上する,嗜好最適化に基づくフレームワークを提案する。コードとデータは、https://github.com/megagonlabs/TIM.comで入手できる。

論文の概要: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

関連論文リスト