Fugu-MT 論文翻訳(概要): When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

論文の概要: When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

arxiv url: http://arxiv.org/abs/2606.05806v1
Date: Thu, 04 Jun 2026 07:38:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.626666
Title: When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Title（参考訳）: ツールは失敗した: LLMエージェントの動的リプランニングと異常回復のベンチマーク
Authors: Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin,
Abstract要約: 既存のベンチマークでは、LLMにおけるツール統合推論を理想化された'ハッピーパス'に基づいて評価している。我々はTIRエージェントの動的経路探索とエラー回復のためのベンチマークであるToolMazeを紹介する。
参考スコア（独自算出の注目度）: 48.32450507410869
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
Abstract（参考訳）: 既存のベンチマークでは、LLMのツール統合推論(TIR)を理想化された'ハッピーパス'に基づいて評価している。本稿では,TIRエージェントの動的経路探索とエラー回復のためのベンチマークであるToolMazeを紹介する。盲目的の試行錯誤から体系的な再計画を分離するために、ToolMazeでは、DAGベースのトポロジ的複雑性と、ツール摂動(明示的/単純、過渡的/永続的)の2ドル2セントの分類という、2次元の設計を採用しています。評価によると、摂動は、ほとんどすべてのモデルでパフォーマンスを低下させ、暗黙のセマンティックな障害の下で最も鋭いドロップを発生させる。腐敗したアウトプットにおける系統的過信によって駆動される摂動回復率(PRR)は、これらのシナリオで約37\%低下する一方、複雑なトポロジは、無駄な試行錯誤ループでエージェントをトラップする。決定的に言えば、エージェント的フォールトトレランスは、基本的なタスク実行よりも3.66\times$遅いモデルスケールで改善され、動的リプランニングは、モデルスケーリングやプロンプトによって守られない、明確なボトルネックとして強調される。データとコードはhttps://github.com/Zhudongsheng75/ToolMaze.comで入手できる。

論文の概要: When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

関連論文リスト