Related papers: GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

URL: http://arxiv.org/abs/2509.15738v1
Date: Fri, 19 Sep 2025 08:09:18 GMT
Title: GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning
Authors: Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li,
Abstract summary: GUI-ReWalk is a multi-stage framework for synthesizing realistic and diverse GUI trajectories.<n>By combining randomness with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects intent-aware, adaptive nature of human-computer interaction.<n>Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent.
Score: 11.909652592163896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation.

Related papers

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands [59.222064425122795]
We develop ShowUI-$$, the first flow-based generative model as GUI dexterous hand.<n>ShowUI-$$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach.
arXiv Detail & Related papers (2025-12-31T16:51:14Z)
History-Aware Reasoning for GUI Agents [15.519853892615272]
Current methods integrate Reinforcement Learning with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement.<n>We propose a History-Aware Reasoning framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge.<n>We develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware.
arXiv Detail & Related papers (2025-11-12T09:06:25Z)
GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z)
UIPro: Unleashing Superior Interaction Capability For GUI Agents [33.77980648230746]
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence.<n>Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs)<n>This paper proposes textUIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data.
arXiv Detail & Related papers (2025-09-22T03:04:53Z)
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds [21.902626737678286]
Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks.<n>Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively.<n>We propose a high-performance generalist agent OmniActor, designed from both structural and data perspectives.
arXiv Detail & Related papers (2025-09-02T13:47:54Z)
MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning [83.81404871748438]
MagicGUI is a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments.<n>The framework is underpinned by six key components, including a comprehensive and accurate dataset, enhanced perception and grounding capabilities, a comprehensive and unified action space, and planning-oriented reasoning mechanisms.
arXiv Detail & Related papers (2025-07-19T12:33:43Z)
MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z)
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.34801160469067]
MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge.<n>We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms.<n>With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents.
arXiv Detail & Related papers (2025-05-22T16:01:06Z)
SpiritSight Agent: Advanced GUI Agent with One Look [7.470506991479107]
An ideal Graphical User Interface (GUI) agent is expected to achieve high accuracy, low latency, and compatibility.<n>Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs)<n>We propose $textbfSpiritSight$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks.
arXiv Detail & Related papers (2025-03-05T05:30:22Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.