Summary
This week's work reflects a shift from building GUI-capable VLM/LLM agents toward evaluating them more rigorously across platforms, capability levels, and failure modes. Representative papers argue that current benchmarks remain too narrow—often limited to task success rate on a single platform—while grounding accuracy, operational efficiency, hallucination diagnosis, and privacy-aware execution are insufficiently measured.
Situation
As vision-language and multimodal language models improve, GUI agents are increasingly used to automate complex interactions across software environments. However, the representative papers emphasize that evaluation has lagged behind this progress: existing benchmarks often test isolated skills rather than the full chain from interface understanding to execution, focus mainly on success rate instead of operational efficiency, and provide limited coverage of real-world platforms such as desktop systems and cross-application settings.
The introductions also highlight two especially important gaps. First, precise GUI grounding remains a core bottleneck, particularly in desktop environments like Windows where structured interface metadata is often unavailable and screenshots become the primary source of interaction. Second, realistic deployment raises privacy concerns because GUI agents frequently process screenshots containing sensitive personal information, yet the field still lacks mature benchmarks and frameworks for measuring privacy recognition, protection, and task execution under privacy constraints.
Infographic (English)

Progress
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation <See Details on Fugu-MT>
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation is the seed paper anchoring this week's theme. It remains the baseline reference for interpreting this week's progress.
PlayCoder: Making LLM-Generated GUI Code Playable <See Details on Fugu-MT>
PlayCoder broadens GUI evaluation by benchmarking whether LLM-generated applications are end-to-end playable, not just test-case correct. Compared with prior assessments focused on code-level accuracy, it introduces a 43-app multilingual benchmark and the Play@k metric targeting executable playability.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding <See Details on Fugu-MT>
This paper addresses a core GUI-agent bottleneck by replacing static consistency heuristics with a learned visual critic for grounding language instructions to pixel coordinates. The co-evolved proposer–critic approach lets the model select among on-screen visual proposals, improving localization accuracy over fixed selection strategies.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents <See Details on Fugu-MT>
HalluClear provides a dedicated suite for diagnosing, evaluating, and mitigating hallucinations specific to GUI agents. Compared with earlier broad task-success metrics, it introduces a GUI-specific hallucination taxonomy and a three-stage evaluation workflow that improves VLM-as-judge reliability.
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents <See Details on Fugu-MT>
This paper formalizes temporal UI state inconsistency (TOCTOU attacks) as a distinct failure mode for desktop GUI agents. Compared with prior screenshot-then-click setups that assume static screens, it demonstrates the need for state revalidation before action dispatch and proposes a lightweight three-layer defense.
Outlook
Outlook Summary
GUI-agent evaluation is likely to move from single-score leaderboards toward broader diagnostic benchmarks. New tests will cover desktop, mobile, web, and cross-app workflows, and will judge not only task completion but also grounding accuracy, efficiency, playability, hallucinated UI elements, and consistency as interface state changes. This reflects work on multi-platform GUI benchmarks, desktop grounding, executable playability, and hallucination analysis. A second direction is closer alignment between evaluation and known bottlenecks: precise localization and privacy-aware execution. Future benchmarks are likely to use trajectory logs, dynamic UI states, and privacy detectors to test whether agents can act reliably when screenshots are sanitized or constrained.
Infographic (English)

Three-Year Movement
Over the next year, GUI-agent evaluation is likely to become more diagnostic, with benchmarks that inspect each action instead of only the final result. These tests will use trajectory logs, meaning records of the agent’s steps, to find bad clicks, redundant actions, hallucinated interface elements, state-tracking failures, and privacy mistakes. By 36 months, this movement points to interactive online benchmarks across desktop, mobile, web, and cross-app workflows, where interfaces can change during the task. The main research challenge will be reliable localization, efficient planning, stable state tracking, recovery after mistakes, and privacy-preserving completion. In deployment, richer evaluation will become part of the approval pipeline for bounded enterprise and consumer workflows. Agents will be trusted only where they can show accurate clicking, limited unnecessary action, recovery from UI changes, and different handling of sensitive screen content such as names, emails, credentials, or financial information.
In the near term, broader GUI-agent benchmarks will expose a repeatability problem: the same task can change because of account state, app versions, notifications, privacy masks, or earlier agent actions. Research is therefore likely to build more instrumented benchmark harnesses, where the test environment includes reset tools, logs, UI-state snapshots, version records, and replayable failure traces. By 36 months, leading benchmarks may look less like static leaderboards and more like continuous evaluation services. They would maintain versioned apps, resettable accounts, privacy-differentiated task states, controlled UI changes, and long trace analysis. Agents would be judged on whether they keep grounding, efficiency, privacy compliance, recovery, and temporal consistency across repeatable but evolving GUI states. In practical use, regulated or high-assurance automation would depend on CI-like testing pipelines, meaning repeated automated checks before and after model or software updates. Deployment decisions would rely on auditable evidence, not just a headline success score.
In the next year, privacy-aware GUI evaluation is likely to test agents on both full screenshots and sanitized screenshots. Sanitized screenshots hide sensitive regions, but this can also remove visual cues that help the agent find the right button, field, or menu item. Benchmarks will need to show whether failure comes from poor grounding, too much redaction, too little redaction, or confusion after the interface changes. By 36 months, this path points to dynamic protected-execution tests. A protected execution trace records what the agent saw, what was hidden, why information was shown or withheld, what action it took, and whether it recovered from masking or UI drift. Progress would be measured as a tradeoff among task success, number of steps, privacy exposure, grounding accuracy, and robustness. In applications, agents would be deployed only in defined workflow classes, with visibility budgets, privacy detectors, partial-view grounding modules, and audit trails that make screen use reviewable.
1-Year / 3-Year Research-Application Infographic

References
- WinClick: GUI Grounding with Multimodal Large Language Models - Authors: Zheng Hui, Yinheng Li, Dan zhao, Tianyi Chen, Colby Banbury, Kazuhito Koishida, / <See Details on Fugu-MT> / License: CC-BY-4.0
- MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents - Authors: Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang, / <See Details on Fugu-MT> / License: CC-BY-4.0