Fugu-MT 論文翻訳(概要): AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

論文の概要: AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

arxiv url: http://arxiv.org/abs/2604.24441v1
Date: Mon, 27 Apr 2026 13:06:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.013175
Title: AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
Title（参考訳）: AutoGUI-v2: ベンチマークを理解する総合的なマルチモーダルGUI機能
Authors: Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang,
Abstract要約: AutoGUI-v2は、深いGUI機能の理解と相互作用結果の予測を評価するために設計されたベンチマークである。我々は、スクリーンショットを階層的な機能領域に解析する新しいVLM-ヒューマン協調パイプラインを用いて、ベンチマークを構築した。 AutoGUI-v2は、リージョンと要素レベルのセマンティクス、グラウンド、動的状態予測のエージェントを厳格にテストする。
参考スコア（独自算出の注目度）: 32.66632642377623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
Abstract（参考訳）: グラフィカルユーザインタフェース(GUI)をナビゲートできる自律エージェントは、デジタル生産性に革命をもたらす可能性を秘めている。しかし、真のデジタル自律性を達成するには、反応的要素マッチングを超えて、インターフェースダイナミクスの予測的メンタルモデルと、相互作用によって生じる「デジタル世界状態」を予測する能力が必要である。現代のビジョン・ランゲージ・モデル(VLM)の知覚能力にもかかわらず、既存のベンチマークは(ブラックボックスのタスク補完か、静的で浅いグラウンドに焦点をあてる)分岐し、エージェントがGUIの暗黙的な機能とトランジションロジックを真に理解しているかどうかを判断することができない。このギャップを埋めるために、我々は、深いGUI機能の理解と相互作用結果の予測を評価するために設計された包括的なベンチマークであるAutoGUI-v2を紹介する。マルチプラットフォームスクリーンショットを階層的な機能領域に再帰的に解析し,多様な評価タスクを生成する新しいVLM協調パイプラインを用いてベンチマークを構築した。 6つのオペレーティングシステムで2,753のタスクを提供するAutoGUI-v2は、リージョンと要素レベルのセマンティクス、グラウンド、動的状態予測のエージェントを厳格にテストする。エージェントデータ(例えばQwen3-VL)を微調整したオープンソースモデルは機能的接地において優れているが,商用モデル(例:Gemini-2.5-Pro-Thinking)は機能キャプションにおいて優位である。重要なことは、すべてのモデルが、珍しいアクションの複雑な相互作用ロジックに苦しむことであり、深い機能的理解が依然として重大なハードルであることを強調している。これらの基礎機能を体系的に測定することで、AutoGUI-v2は次世代のGUIエージェントを進化させるための新しいレンズを提供する。

論文の概要: AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

関連論文リスト