UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
- URL: http://arxiv.org/abs/2505.21496v1
- Date: Tue, 27 May 2025 17:58:06 GMT
- Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
- Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li,
- Abstract summary: We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
- Score: 37.871793585090586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
Related papers
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [14.42815380175443]
We introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages.<n>For training, we propose a suite of improvements to the Supervised Fine-Tuning process.<n>For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays.
arXiv Detail & Related papers (2025-07-29T17:22:07Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents [15.303188467166752]
We present CogniGUI, a cognitive framework developed to overcome limitations by enabling adaptive learning for GUI automation resembling human-like behavior.<n>To assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence.<n> Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
arXiv Detail & Related papers (2025-06-22T06:30:52Z) - GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.34801160469067]
MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge.<n>We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms.<n>With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents.
arXiv Detail & Related papers (2025-05-22T16:01:06Z) - UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis [15.429065788185522]
We introduce a large-scale data synthesis pipeline UI-E2I- Synth for generating varying complex instruction datasets.<n>We propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks.<n>Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding.
arXiv Detail & Related papers (2025-04-15T14:56:21Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.