Fugu-MT 論文翻訳(概要): DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

論文の概要: DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

arxiv url: http://arxiv.org/abs/2605.23939v1
Date: Tue, 28 Apr 2026 11:39:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-01 02:55:42.95692
Title: DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning
Title（参考訳）: DRIVE:連続学習におけるWebエージェントの推論・インタラクションレベルにおけるスキルのモデル化
Authors: Xirui Liu, Sihang Zhou, Yanning Hou, Rong Zhou, Haoyuan Chen, Maolin He, Siwei Wang, Hao Chen, Jian Huang,
Abstract要約: Webエージェントは、異なるタスクを実行するために、ハイレベル推論と低レベルインタラクションの両方を必要とします。本稿では、歴史的経験を自然言語推論スキルに分割する二段階スキルモデリングフレームワークDRIVEを提案する。実験によると、DRIVEのタスク成功率は52.8%で、スキルフリーのベースラインを7.3%上回っている。
参考スコア（独自算出の注目度）: 17.92660876001036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.
Abstract（参考訳）: Webエージェントは、異なるタスクを実行するために、高レベルの推論(タスク分解)と低レベルの相互作用(ページ要素操作)の両方を必要とします。しかし、これらの知識タイプは基本的に異なる: 推論知識(例えば、フライトの予約にはルートを最初に検索する必要がある)は抽象的で、ウェブサイト間で転送可能である一方、インタラクション知識(例えば、サイトAの特定の座標で検索ボタンをクリックする)は、ページ固有のコンテキストに大きく依存する。既存のメソッドはエクスペリエンスを均一に保存する。抽象表現は具体的なページで実行可能性を失うが、具体的な表現はドメインをまたいだ一般化に失敗する。新しいウェブサイトでは、エージェントは表面レベルでの違いのために再利用可能なタスクロジックを認識できないか、時代遅れのページ構造から実現不可能なアクションを試みます。 DRIVEは、過去の経験を自然言語推論スキルに分離し、伝達可能なタスクロジックとプログラム間相互作用スキルをキャプチャし、抽象的なアクションを実行可能操作に基盤付ける。シーン認識調整機構は、タスクセマンティクスに基づいて、これらのデュアルレベルスキルを適応的に検索し、呼び出す。 DRIVEはまた、スキルレベルのリフレクションを使用して階層固有の障害モードを特定し、ターゲットとするスキルライブラリの拡張と改善を可能にしている。 5つのWebArenaドメインでの実験では、DRIVEのタスク成功率は52.8%で、スキルフリーのベースラインを7.3%上回っている。さらに、推論と相互作用のスキルは、実行可能ページレベルの操作から転送可能なタスクロジックを分離することをサポートする、相補的な利点を提供する。

論文の概要: DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

関連論文リスト