Fugu-MT 論文翻訳(概要): PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

論文の概要: PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

arxiv url: http://arxiv.org/abs/2606.22388v1
Date: Sun, 21 Jun 2026 08:29:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 18:34:20.078026
Title: PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
Title（参考訳）: PlanBench-XL:大規模ツールエコシステムにおけるLLMツール利用エージェントの長期計画評価
Authors: Jiayu Liu, Qihan Lin, Cheng Qian, Rui Wang, Emre Can Acikgoz, Xiaocheng Yang, Jiateng Liu, Zhenhailong Wang, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür,
Abstract要約: PlanBench-XLは、327の小売タスクを1,665以上のツールでインタラクティブにベンチマークする。エージェントが使用可能なツールを反復的に検索できるかどうかをテストし、最終目標に対するその後の呼び出しの中間的証拠を明らかにするためにそれらを呼び出す。
参考スコア（独自算出の注目度）: 59.730861364166174
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.
Abstract（参考訳）: LLMエージェントは、現実のタスクが関連するツールを発見し、暗黙のサブゴールを推測し、長い地平線上で動的環境に適応する必要がある、大規模なツールエコシステムでますます機能する。しかし、既存のベンチマークでは、検索に制限されたツールの可視性の下での計画を評価することはめったにない。このギャップに対処するために、PlanBench-XLという327の小売タスクを1,665以上のツールでインタラクティブにベンチマークし、エージェントが反復的に使用可能なツールを検索できるかどうかを検証し、それらを呼び出し、最終的な目標に向けた呼び出しの中間的証拠を明らかにする。 PlanBench-XLはさらに、障害のあるパスを検出し、実行時に適応するようにエージェントを強制する、ツール関数の欠如、障害、中断を通じて現実世界の予測不可能をシミュレートする、オプションのブロッキングメカニズムを備えている。 GPT-5.4は51.90%の精度でブロックのない環境では、最も厳しいブロッキング条件下では11.36%に崩壊する。さらなる分析によると、エラー信号が明示的でない場合や、リカバリがより長いツール使用パスを必要とする場合、エージェントは特に脆弱である。これらの結果は,エージェント計画失敗を診断するためのテストベッドとしてPlanBench-XLを確立し,大規模で不完全なツール環境を持つ長期タスクにおいて,堅牢な適応計画の必要性を強調した。

論文の概要: PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

関連論文リスト