Fugu-MT 論文翻訳(概要): GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

論文の概要: GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

arxiv url: http://arxiv.org/abs/2603.26266v1
Date: Fri, 27 Mar 2026 10:33:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.453583
Title: GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
Title（参考訳）: GUIDE:リアルタイムWebビデオ検索とプラグイン・アンド・プレイアノテーションによるGUIエージェントのドメインバイアスの解消
Authors: Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li,
Abstract要約: Guideは、GUIエージェントのドメインバイアスを解決する、トレーニング不要のプラグイン・アンド・プレイフレームワークである。 Webチュートリアルビデオからドメイン固有の専門知識を自律的に取得する。モデルパラメータやアーキテクチャを変更することなく、一貫して5%以上の改善が得られます。
参考スコア（独自算出の注目度）: 27.67914789321192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent's corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE's generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.
Abstract（参考訳）: 大規模な視覚言語モデルはGUIエージェントにインタフェースの理解とインタラクションのための強力な汎用能力を与えている。しかしながら、トレーニング中のドメイン固有のソフトウェア操作データへの露出が不十分なため、これらのエージェントは、特定のアプリケーションの特定のオペレーションワークフロー(計画)とUI要素レイアウト(グラウンド)に精通せず、実際のタスクのパフォーマンスを制限している、という、ドメインバイアスがかなり大きい。本稿では,GUIエージェントのドメインバイアスを解決するためのGUIDE(GUI Unbiasing via Instructional-Video Driven Expertise)を提案する。 GUIDEは2つの重要なイノベーションを紹介します。まず、サブタイトル駆動のVideo-RAGパイプラインは、サブタイトル分析を通じてビデオセマンティクスをアンロックし、プログレッシブな3段階検索(ドメイン分類、トピック抽出、関連マッチング)を実行し、タスク関連チュートリアルビデオを特定する。第二に、逆ダイナミクスパラダイムに基づいて構築された完全に自動化されたアノテーションパイプラインは、UI要素の検出によって強化された連続キーフレームをVLMに供給し、エージェントの対応するモジュールに注入される必要な計画と基礎知識を推論して、ドメインバイアスの両方の表現に対処する。 OSWorldでの大規模な実験では、GUIDEの汎用性をマルチエージェントシステムとシングルモデルエージェントの両方のプラグイン・アンド・プレイコンポーネントとして示している。モデルパラメータやアーキテクチャを変更することなく、GUIエージェントドメインバイアスをブリッジするアーキテクチャに依存しない拡張としてGUIDEを検証する。

論文の概要: GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

関連論文リスト