GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents
- URL: http://arxiv.org/abs/2511.04307v2
- Date: Mon, 10 Nov 2025 12:27:15 GMT
- Title: GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents
- Authors: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang,
- Abstract summary: GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
- Score: 59.107657859025586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.
Related papers
- ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z) - GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents [39.807839972627015]
We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks.<n>We introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding.<n>On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples.
arXiv Detail & Related papers (2026-01-14T14:27:28Z) - ProBench: Benchmarking GUI Agents with Accurate Process Information [15.519853892615272]
We introduce ProBench, a comprehensive benchmark with over 200 challenging GUI tasks covering widely-used scenarios.<n>We extend our dataset to include Process-related Task and design a specialized evaluation method.<n>Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios.
arXiv Detail & Related papers (2025-11-12T09:49:31Z) - AUTO-Explorer: Automated Data Collection for GUI Agent [58.58097564914626]
We propose an automated data collection method with minimal annotation costs, named Auto-Explorer.<n>It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments.<n>Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set.
arXiv Detail & Related papers (2025-11-09T15:13:45Z) - GUIrilla: A Scalable Framework for Automated Desktop UI Exploration [0.0]
GUIrilla is an automated framework that explores applications via native accessibility APIs to address the critical data collection challenge in GUI automation.<n>We construct and release GUIrilla-Task, a large-scale dataset of 27,171 functionally grounded tasks across 1,108 applications.<n> tuning LLM-based agents on GUIrilla-Task significantly improves performance on downstream UI tasks, outperforming synthetic baselines on the ScreenSpot Pro benchmark while using 97% less data.
arXiv Detail & Related papers (2025-10-16T19:03:45Z) - UIPro: Unleashing Superior Interaction Capability For GUI Agents [33.77980648230746]
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence.<n>Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs)<n>This paper proposes textUIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data.
arXiv Detail & Related papers (2025-09-22T03:04:53Z) - GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning [11.909652592163896]
GUI-ReWalk is a multi-stage framework for synthesizing realistic and diverse GUI trajectories.<n>By combining randomness with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects intent-aware, adaptive nature of human-computer interaction.<n>Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent.
arXiv Detail & Related papers (2025-09-19T08:09:18Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies [34.63675989928621]
We introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation.<n>We also propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools.<n>This paradigm significantly reduces annotation time cost by a factor of over 19 times.<n>We assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios.
arXiv Detail & Related papers (2025-06-17T12:50:35Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.