Fugu-MT 論文翻訳(概要): Where LLM Agents Fail and How They can Learn From Failures

論文の概要: Where LLM Agents Fail and How They can Learn From Failures

arxiv url: http://arxiv.org/abs/2509.25370v1
Date: Mon, 29 Sep 2025 18:20:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.264117
Title: Where LLM Agents Fail and How They can Learn From Failures
Title（参考訳）: LLMエージェントの失敗と失敗から学ぶ方法
Authors: Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You,
Abstract要約: 大規模言語モデル(LLM)エージェントは、複雑なマルチステップタスクの解決において有望であることを示す。単一ルート原因エラーがその後の決定を通じて伝播する、障害のカスケードに対する脆弱性を増幅する。現在のシステムは、モジュール的で体系的な方法でエージェントエラーを包括的に理解できるフレームワークを欠いている。 AgentErrorTaxonomyは、メモリ、リフレクション、計画、アクション、システムレベルの操作にまたがる障害モードのモジュール分類である。
参考スコア（独自算出の注目度）: 62.196870049524364
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug
Abstract（参考訳）: 計画、メモリ、リフレクション、ツール使用モジュールを統合するLLM(Large Language Model)エージェントは、複雑なマルチステップタスクの解決において有望であることを示している。しかし、高度なアーキテクチャは、単一ルート原因のエラーがその後の決定を通じて伝播し、タスクの失敗につながる、障害のカスケードに脆弱性を増幅する。現在のシステムは、モジュール的で体系的な方法でエージェントエラーを包括的に理解できるフレームワークを欠いているため、これらのエラーを検出できない。このギャップに3つのコントリビューションで対処します。まず、メモリ、リフレクション、計画、アクション、システムレベルの操作にまたがる障害モードのモジュール分類であるAgentErrorTaxonomyを紹介します。第2に,ALFWorld,GAIA,WebShopから体系的にアノテートされた障害トラジェクトリの最初のデータセットであるAgentErrorBenchを構築し,実世界のエージェントロールアウトでエラー解析を行う。第3に,根本原因の障害を分離し,修正フィードバックを提供するデバッグフレームワークであるAgentDebugを提案する。 AgentErrorBenchの実験では、AgentDebugは最強のベースラインに比べて24%高い全補正精度と17%高いステップ精度を達成した。 AgentDebugが生成するターゲットフィードバックによって、LSMエージェントが障害から反復的にリカバリすることが可能になり、ALFWorld、GAIA、WebShop全体でタスクの成功率が26%向上する。これらの結果は、より信頼性が高く適応的なLSMエージェントへの経路として、原則的デバッグを確立する。コードとデータはhttps://github.com/ulab-uiuc/AgentDebugで入手できる。

論文の概要: Where LLM Agents Fail and How They can Learn From Failures

関連論文リスト