Fugu-MT 論文翻訳(概要): FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

論文の概要: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

arxiv url: http://arxiv.org/abs/2508.09241v1
Date: Tue, 12 Aug 2025 15:12:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.648233
Title: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
Title（参考訳）: FineState-Bench: GUIエージェントの詳細な状態制御のための総合ベンチマーク
Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen,
Abstract要約: ファインステートベンチ(FinState-Bench)は,GUIプロキシ操作のための評価および診断標準である。 FineState-Benchには4つのコンポーネントに2257のタスクベンチマークが含まれており、知覚制御評価に4フェーズインジケータを使用している。我々の診断フレームワークは、現在のGUIプロキシの最大のボトルネックが基本的な視覚的位置決め能力であることを初めて確認します。
参考スコア（独自算出の注目度）: 12.315613848863784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench
Abstract（参考訳）: 生成人工知能技術の急速な進歩により、グラフィカルユーザインタフェース(GUI)エージェントは自然言語による日々のタスクを自律的に管理する大きな可能性を実証した。しかし、GUIエージェントの現在の評価フレームワークは、根本的な欠陥に悩まされている。既存のベンチマークは、現実世界のアプリケーションに不可欠なきめ細かい制御機能を無視しながら、粗いきめ細かなタスク補完に過度に重点を置いている。そこで我々はファインステート・ベンチ(FineState-Bench)について紹介する。このマルチプラットフォーム(デスクトップ、Web、モバイル)フレームワークには、4つのコンポーネントに2257のタスクベンチマークが含まれており、総合的な知覚制御評価に4フェーズインジケータを使用している。そこで我々は,視覚診断アシスタント(VDA)を開発した。これにより,これらの機能について,最初の定量的デカップリング分析が可能となった。実験結果から,最も先進的なモデルでは,微細な相互作用精度が32.8%しか得られないことが判明した。 VDAを制御された実験で使用し、視覚能力の影響を定量化することで、理想的な視覚的ローカライゼーションにより、Gemini-2.5-Flashの成功率が14.9%向上することを示した。我々の診断フレームワークは、現在のGUIプロキシの最大のボトルネックが基本的な視覚的位置決め能力であることを初めて確認し、すべてのリソースが完全にオープンソースである。 https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench

論文の概要: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

関連論文リスト