Fugu-MT 論文翻訳(概要): Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

論文の概要: Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arxiv url: http://arxiv.org/abs/2605.14038v2
Date: Sun, 17 May 2026 15:23:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:45.904924
Title: Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Title（参考訳）: LLMツール使用におけるノウハウ・ド・ギャップのモデル適応ツールの必要性
Authors: Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi,
Abstract要約: 大規模言語モデル(LLM)は、外部ツールを呼び出す時と直接答える時を判断しなければならない自律的なエージェントとして、ますます機能します。本稿では,各モデルの経験的性能に基づいて,ツール必要度をモデル適応的に定義する。その結果,26.5～54.0%,30.8～41.8%のミスマッチが認められた。
参考スコア（独自算出の注目度）: 47.29360932085394
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.
Abstract（参考訳）: 大規模言語モデル(LLM)は、外部ツールを呼び出す時と直接答える時を判断しなければならない自律的なエージェントとして、ますます機能します。適応ツールの使用を研究する以前の研究は、ツールの必要性をモデルに依存しない性質として扱い、人間やLLMの裁判官によって注釈付けされ、答えが明らかなケースを主にカバーしていた(例えば、天気とパラフレーズテキストをフェッチする)。しかし、モデル間の機能境界のばらつきにより、ワイルドにおけるツールの必要性はより曖昧になっている。本研究では,各モデルの経験的性能を基礎として,ツール必要度をモデル適応的に定義する。この定義に従うと、算術的および実数的QAデータセット上の4つのモデルで観測されたツールコール行動に対する必要性を比較し、それぞれ26.5-54.0%と30.8-41.8%のミスマッチを求める。失敗を診断するために、我々はツールの使用を、モデルがツールが必要であると信じるかどうかを反映する内部認識段階と、モデルが実際にツールコールアクションを行うかどうかを決定する実行段階の2つの段階に分解する。 LLM隠蔽状態の探索により、両信号はしばしば線形デオード可能であるが、そのプローブ方向は、次の起爆作用を駆動する遅延層、最終トーケン状態においてほぼ直交する。 2段階のプロセスでサンプルの軌跡を辿ることで、ミスマッチの大多数が認識から行動への遷移に集中していることが分かる。ツール使用信頼性の向上には,ツールが必要なときの認識性の向上だけでなく,その認識の動作への変換性の向上も必要である。

論文の概要: Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

関連論文リスト