SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis
- URL: http://arxiv.org/abs/2601.18305v1
- Date: Mon, 26 Jan 2026 09:35:10 GMT
- Title: SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis
- Authors: Xuan Wang, Siyuan Su, Quantong Fu, Yongxiang Hu, Yangfan Zhou,
- Abstract summary: We decompose human swipe gestures into quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions.<n>Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents.<n>We propose GUISwiper, a GUI agent with enhanced interaction execution capabilities.
- Score: 11.291868789244496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.
Related papers
- ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands [59.222064425122795]
We develop ShowUI-$$, the first flow-based generative model as GUI dexterous hand.<n>ShowUI-$$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach.
arXiv Detail & Related papers (2025-12-31T16:51:14Z) - UIPro: Unleashing Superior Interaction Capability For GUI Agents [33.77980648230746]
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence.<n>Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs)<n>This paper proposes textUIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data.
arXiv Detail & Related papers (2025-09-22T03:04:53Z) - GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning [11.909652592163896]
GUI-ReWalk is a multi-stage framework for synthesizing realistic and diverse GUI trajectories.<n>By combining randomness with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects intent-aware, adaptive nature of human-computer interaction.<n>Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent.
arXiv Detail & Related papers (2025-09-19T08:09:18Z) - GTA1: GUI Test-time Scaling Agent [97.58177633084915]
Graphical user interface (GUI) agents autonomously complete tasks across platforms (eg, Linux) by sequentially decomposing user instructions into action proposals.<n>This paper investigates the aforementioned challenges with our textbfGUI textbfTest-time Scaling textbfAgent, namely GTA1.
arXiv Detail & Related papers (2025-07-08T08:52:18Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism.<n>In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation.<n>Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.