See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
- URL: http://arxiv.org/abs/2602.10814v1
- Date: Wed, 11 Feb 2026 12:54:53 GMT
- Title: See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
- Authors: Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang,
- Abstract summary: We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch.<n>ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute.<n>To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs.
- Score: 20.231957791642635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
Related papers
- D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies [39.738017374978796]
We propose D-GARA, a benchmarking framework to evaluate Android GUI agent robustness in real-world anomalies.<n>Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies.<n> Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments.
arXiv Detail & Related papers (2025-11-20T17:43:46Z) - ProBench: Benchmarking GUI Agents with Accurate Process Information [15.519853892615272]
We introduce ProBench, a comprehensive benchmark with over 200 challenging GUI tasks covering widely-used scenarios.<n>We extend our dataset to include Process-related Task and design a specialized evaluation method.<n>Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios.
arXiv Detail & Related papers (2025-11-12T09:49:31Z) - MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z) - MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning [83.81404871748438]
MagicGUI is a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments.<n>The framework is underpinned by six key components, including a comprehensive and accurate dataset, enhanced perception and grounding capabilities, a comprehensive and unified action space, and planning-oriented reasoning mechanisms.
arXiv Detail & Related papers (2025-07-19T12:33:43Z) - Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents [15.303188467166752]
We present CogniGUI, a cognitive framework developed to overcome limitations by enabling adaptive learning for GUI automation resembling human-like behavior.<n>To assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence.<n> Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
arXiv Detail & Related papers (2025-06-22T06:30:52Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction.<n>Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms.<n>This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z) - GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent [24.97846085313314]
We propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing.<n>We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection.<n>It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data.
arXiv Detail & Related papers (2024-12-24T13:41:47Z) - GUI Agents: A Survey [159.7656453000263]
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction.<n>Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods.
arXiv Detail & Related papers (2024-12-18T04:48:28Z) - De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.