FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
- URL: http://arxiv.org/abs/2508.09241v1
- Date: Tue, 12 Aug 2025 15:12:42 GMT
- Title: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
- Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen,
- Abstract summary: We introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations.<n>FineState-Bench includes 2257 task benchmarks in four components and uses a four-phase indicator for perception-to-control assessment.<n>Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.
- Score: 12.315613848863784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench
Related papers
- GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z) - GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents [39.807839972627015]
We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks.<n>We introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding.<n>On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples.
arXiv Detail & Related papers (2026-01-14T14:27:28Z) - ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands [59.222064425122795]
We develop ShowUI-$$, the first flow-based generative model as GUI dexterous hand.<n>ShowUI-$$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach.
arXiv Detail & Related papers (2025-12-31T16:51:14Z) - Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards [44.69783955774917]
DashboardQA is a benchmark designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards.<n>It includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational.<n>Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated.
arXiv Detail & Related papers (2025-08-24T15:11:44Z) - GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents [13.415165482033395]
Out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of GUI agents may suffer task breakdowns or pose security threats.<n>Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments.<n>We propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary.
arXiv Detail & Related papers (2025-05-19T08:29:05Z) - Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point [17.165899818213475]
We introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications.<n>WorldGUI-Agent is a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization.
arXiv Detail & Related papers (2025-02-12T01:06:10Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.