ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
- URL: http://arxiv.org/abs/2512.24965v1
- Date: Wed, 31 Dec 2025 16:51:14 GMT
- Title: ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
- Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou,
- Abstract summary: We develop ShowUI-$$, the first flow-based generative model as GUI dexterous hand.<n>ShowUI-$$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach.
- Score: 59.222064425122795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.
Related papers
- SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis [11.291868789244496]
We decompose human swipe gestures into quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions.<n>Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents.<n>We propose GUISwiper, a GUI agent with enhanced interaction execution capabilities.
arXiv Detail & Related papers (2026-01-26T09:35:10Z) - Beyond Clicking:A Step Towards Generalist GUI Grounding via Text Dragging [21.57463393334841]
dragging the mouse to select and manipulate textual content represents a prevalent and important usage in practical GUI scenarios.<n>We introduce GUI-Drag, a dataset of 161K text dragging examples synthesized through a scalable pipeline.<n>To support systematic and robust evaluation, we construct ScreenDrag, a benchmark with 5,333 examples spanning three levels of interface context.
arXiv Detail & Related papers (2025-11-07T19:40:09Z) - GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z) - GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning [11.909652592163896]
GUI-ReWalk is a multi-stage framework for synthesizing realistic and diverse GUI trajectories.<n>By combining randomness with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects intent-aware, adaptive nature of human-computer interaction.<n>Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent.
arXiv Detail & Related papers (2025-09-19T08:09:18Z) - GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding [51.497245303008015]
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction.<n>Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards.<n>We show that GUI-G$2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-07-21T17:53:42Z) - MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning [83.81404871748438]
MagicGUI is a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments.<n>The framework is underpinned by six key components, including a comprehensive and accurate dataset, enhanced perception and grounding capabilities, a comprehensive and unified action space, and planning-oriented reasoning mechanisms.
arXiv Detail & Related papers (2025-07-19T12:33:43Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [37.871793585090586]
We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
arXiv Detail & Related papers (2025-05-27T17:58:06Z) - GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents [13.415165482033395]
Out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of GUI agents may suffer task breakdowns or pose security threats.<n>Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments.<n>We propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary.
arXiv Detail & Related papers (2025-05-19T08:29:05Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.