D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
- URL: http://arxiv.org/abs/2511.16590v1
- Date: Thu, 20 Nov 2025 17:43:46 GMT
- Title: D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
- Authors: Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang,
- Abstract summary: We propose D-GARA, a benchmarking framework to evaluate Android GUI agent robustness in real-world anomalies.<n>Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies.<n> Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments.
- Score: 39.738017374978796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
Related papers
- See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch [20.231957791642635]
We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch.<n>ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute.<n>To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs.
arXiv Detail & Related papers (2026-02-11T12:54:53Z) - GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents [15.303188467166752]
We present CogniGUI, a cognitive framework developed to overcome limitations by enabling adaptive learning for GUI automation resembling human-like behavior.<n>To assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence.<n> Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
arXiv Detail & Related papers (2025-06-22T06:30:52Z) - GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies [34.63675989928621]
We introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation.<n>We also propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools.<n>This paradigm significantly reduces annotation time cost by a factor of over 19 times.<n>We assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios.
arXiv Detail & Related papers (2025-06-17T12:50:35Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - On the Robustness of GUI Grounding Models Against Image Attacks [32.731293426828785]
We systematically evaluate the robustness of state-of-the-art GUI grounding models, such as UGround, under three conditions.<n>Our experiments have clearly demonstrated that GUI grounding models exhibit a high degree of sensitivity to adversarial perturbations and low-resolution conditions.
arXiv Detail & Related papers (2025-04-07T03:58:45Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.