Fugu-MT 論文翻訳(概要): Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

論文の概要: Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

arxiv url: http://arxiv.org/abs/2508.15754v1
Date: Thu, 21 Aug 2025 17:50:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.431715
Title: Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
Title（参考訳）: 解離ツールによる推論:実証的研究と解析
Authors: Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen,
Abstract要約: 大規模言語モデル(LLM)は、チェーン・オブ・ソート(CoT)推論のような手法によるタスクの推論において、大きな進歩を遂げてきた。 TIR(Tool-Integrated Reasoning)は、外部ツールを推論プロセスに組み込んだソリューションとして登場した。
参考スコア（独自算出の注目度）: 45.74017777506391
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)は、チェーン・オブ・ソート(CoT)推論のような手法によるタスクの推論において、大きな進歩を遂げている。しかし、それらはしばしば正確な計算を必要とするタスクで不足する。 TIR(Tool-Integrated Reasoning)は、外部ツールを推論プロセスに組み込んだソリューションとして登場した。しかし, LLMの推論能力向上におけるTIRの一般化はいまだに不明である。さらに、TIRがモデルの推論行動を改善し、モデル思考を助けたかどうかについても研究が続けられている。様々な分野におけるTIRの有効性を評価するために,9つの多種多様な推論カテゴリを含む総合ベンチマークReasonZooを紹介する。さらに,AUC-PCC(Performance-Aware Cost)とAUC-PCC(Area Under the Performance-Cost Curve)の2つの新しい指標を提案し,推論効率を評価する。我々の経験的評価は、TIR対応モデルが数学的および非数学的タスクにおいて、TIR以外のモデルよりも一貫して優れていることを示している。さらに、TIRは、PACとAUC-PCCの改善によって証明されたように、推論効率を高め、過剰思考を減らし、より合理化された推論を示す。これらの知見は、複雑な推論タスクにおいて、TIRのドメイン一般の利点とLLM能力を向上する可能性を示している。

論文の概要: Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

関連論文リスト