Fugu-MT 論文翻訳(概要): How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

論文の概要: How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

arxiv url: http://arxiv.org/abs/2605.24660v1
Date: Sat, 23 May 2026 17:02:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.300615
Title: How Many Tools Should an LLM Agent See? A Chance-Corrected Answer
Title（参考訳）: LLMエージェントが見るべきツールはいくつあるか?
Authors: Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell,
Abstract要約: 検索システムは、エージェントに示す候補ツールを決定する必要がある。そのショートリストはどのくらいでよいのか? ほとんどのシステムは全てのクエリに固定されたショートリストサイズを適用するが、そのサイズが適切かどうかを評価するための標準メトリクスは存在しない。我々は、与えられた深さでの成功が、同じ深さでランダム選択が達成されるものよりも優れているかどうかを問う、確率補正された計量であるBits-over-Random(BoR)を評価する。次に、同じ原則を、クエリ毎のツールショートリスト深さを選択するための強化学習(RL)報酬にします。
参考スコア（独自算出の注目度）: 1.5749416770494706
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.
Abstract（参考訳）: LLMエージェントがツールを使用する前に、検索システムはエージェントに示す候補ツールを決定する必要がある。そのショートリストの期間はどのくらいですか。あまりにも多くのツールを示し、モデルは選ぶのに苦労する。あまりに少なく、正しいツールは現れないかもしれない。ほとんどのシステムは全てのクエリに固定されたショートリストサイズを適用するが、そのサイズが適切かどうかを評価するための標準メトリクスは存在しない。我々は, LLMエージェントに示されるツールの数を評価対象として扱い, 確率補正指標であるBits-over-Random(BoR)を適用する。ツール選択ベンチマーク,複数スコア,20～3,251ツールの登録でBoRを評価した。次に、同じ原則を、クエリ毎のツールショートリスト深さを選択するための強化学習(RL)報酬にします。 RLエージェントは意図的に単純であり、提案されたシステムではなくメートル法のプローブとして機能する。ショートリストが大きくなると、正しいツールを含むランダムな確率が上昇するので、報酬は自然に減少し、エンジニアリングされた深さのペナルティが不要になる。 BFCL(370のツール)では、学習されたポリシーは50のツール(90.3\%対90.8\%)のカバレッジとほぼ一致し、平均して7つしか表示できない。 ToolBench (3,251 ツール)では、固定された5つのツールのショートリストが、より高いアグリゲートカバレッジ(64.7 % 対 611.9 % )を達成するが、ハードクエリ(正しいツールが 6 位から 20 位)については何も見つからない。 BoRエージェントは、より深く検索することで、同じクエリに対して$16.7\%の値を求める。 Claude Sonnet 4.6 による下流の検証では、短い適応リストによって LLM が正しいツールを選択する能力も向上していることを示している。

論文の概要: How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

関連論文リスト