Fugu-MT 論文翻訳(概要): AI Planning Framework for LLM-Based Web Agents

論文の概要: AI Planning Framework for LLM-Based Web Agents

arxiv url: http://arxiv.org/abs/2603.12710v1
Date: Fri, 13 Mar 2026 06:46:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.949748
Title: AI Planning Framework for LLM-Based Web Agents
Title（参考訳）: LLMベースのWebエージェントのためのAIプランニングフレームワーク
Authors: Orit Shahnovsky, Rotem Dror,
Abstract要約: 現代のエージェントアーキテクチャを従来の計画パラダイムにマッピングする分類法を導入する。簡単な成功率を超える軌道品質を評価する5つの新しい評価指標を提案する。以上の結果から, ステップ・バイ・ステップ・エージェントはヒトのゴールドトラジェクトリとより密に連携するが, フル・プラン・イン・アドバンス・エージェントは技術的に優れていることがわかった。
参考スコア（独自算出の注目度）: 2.9376953730570197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.
Abstract（参考訳）: ウェブベースのタスクのための自律エージェントの開発は、AIにおける中核的な課題である。 LLM(Large Language Model)エージェントは複雑なユーザリクエストを解釈できるが、ブラックボックスとして動作することが多く、なぜ失敗するか、どのように計画するかの診断が難しい。本稿では、Webタスクをシーケンシャルな意思決定プロセスとして公式に扱うことにより、このギャップに対処する。本稿では,最新のエージェントアーキテクチャを従来の計画パラダイムにマッピングする分類法を紹介する。ステップバイステップエージェントをBFS(Breadth-First Search)に,ツリー検索エージェントをBest-First Tree Searchに,フルプラン・イン・アドバンスエージェントをDFS(Depth-First Search)に,それぞれ導入する。このフレームワークは、コンテキストドリフトや非コヒーレントなタスク分解のような、システム障害の原則的な診断を可能にする。これらの挙動を評価するために,簡単な成功率を超える軌道品質を評価する5つの新しい評価指標を提案する。我々は、WebArenaベンチマークから、794人のラベル付き軌道のデータセットを新たに作成することで、この分析を支援する。最後に,基本となるStep-by-Stepエージェントと新しいFull-Plan-in-Advance実装を比較し,評価フレームワークを検証する。その結果、ステップ・バイ・ステップ・エージェントは、人間のゴールドトラジェクトリ(全体の38%の成功)とより密に連携するが、フル・プラン・イン・アドバンス・エージェントは、要素精度(89%)などの技術的指標に優れており、特定のアプリケーション制約に基づいて適切なエージェントアーキテクチャを選択するための指標の必要性が示されている。

論文の概要: AI Planning Framework for LLM-Based Web Agents

関連論文リスト