Fugu-MT 論文翻訳(概要): VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

論文の概要: VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

arxiv url: http://arxiv.org/abs/2604.21375v1
Date: Thu, 23 Apr 2026 07:42:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.368233
Title: VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
Title（参考訳）: VLAA-GUI - GUI自動化のためのモジュールフレームワーク
Authors: Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie,
Abstract要約: VLAA-GUIは3つの統合コンポーネントを中心に構築されたモジュラーGUIフレームワークである。必須完全性検証は、UIで観測可能な成功基準と検証を、各完了ステップで実施する。強制的なループブレーカは、繰り返し失敗した後、多層切替インタラクションモードを提供する。
参考スコア（独自算出の注目度）: 98.38575149237442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.
Abstract（参考訳）: 自律的なGUIエージェントは、2つの根本的な課題に直面している。早期停止、エージェントが証拠を検証せずに早期に成功を宣言する、繰り返しループ、エージェントが回復せずに同じ失敗するアクションを循環する。 VLAA-GUIは3つの統合コンポーネントを中心に構築され,システムの停止,回復,検索を誘導するモジュール型GUIエージェントフレームワークである。まず、強制完全性検証(Delete Completeness Verifier)は、UIで観測可能な成功基準と検証を、完了ステップ毎に実施する -- エージェントレベルの検証によって、完了要求を決定ルールで相互検査し、直接的な視覚的証拠を欠いているものを拒否する。第二に、強制的なループブレーカは多重層フィルタリング – 繰り返し失敗後のインタラクションモードの切り替え、永続的なスクリーン状態の再発後の戦略変更、戦略シフトへのリフレクション信号のバインディング – を提供する。第三に、オンデマンド検索エージェントは、検索能力のあるLLMを直接クエリして、結果をプレーンテキストとして返却することで、不慣れなワークフローをオンラインで検索する。さらに、コード集約アクションのためのコーディングエージェントと、要求に応じて呼び出される正確なアクショングラウンドディングのためのグラウンドティングエージェントを統合する。 VLAA-GUIは、Opus 4.5、4.6、Gemini 3.1 Proを含むトップクラスの5つのバックボーンで、LinuxとWindowsの2つのベンチマークで評価し、OSWorldで77.5%、WindowsAgentArenaで61.0%)、トップパフォーマンスを達成した。特に5つのバックボーンのうち3つは、OSWorldの人間のパフォーマンス(72.4%)を1回のパスで上回っている。アブレーション研究では、3つの提案されたコンポーネントが常に強力なバックボーンを改善し、ステップ予算が十分であれば、より弱いバックボーンがこれらのツールの恩恵を受けることが示されている。さらに分析した結果、ループブレーカーはループ発生モデルにほとんど無駄なステップを要したことがわかった。

論文の概要: VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

関連論文リスト