Related papers: UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

URL: http://arxiv.org/abs/2503.21620v4
Date: Wed, 14 May 2025 11:56:07 GMT
Title: UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Authors: Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li,
Abstract summary: We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.<n>Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO)<n>For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices.
Score: 31.796328505473305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.

Related papers

Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z)
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [37.871793585090586]
We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
arXiv Detail & Related papers (2025-05-27T17:58:06Z)
ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay [88.74638385288773]
Agentic Replay Policy Optimization improves performance on complex, long-horizon computer tasks.<n>We propose a task selection strategy that filters tasks based on baseline agent performance.<n>Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results.
arXiv Detail & Related papers (2025-05-22T06:24:32Z)
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents [15.29032612749017]
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding.<n>We first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update.<n>Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-05-21T17:59:09Z)
Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability [16.87554947089102]
We propose ARENA, a transparent RAG generator framework trained via reinforcement learning (RL) with our proposed rewards.<n>Based on the structured generation and adaptive reward calculation, our RL-based training enables the model to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces.
arXiv Detail & Related papers (2025-05-19T15:40:29Z)
Reinforcement Learning for Reasoning in Large Language Models with One Training Example [129.11571295790807]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs) We identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated.
arXiv Detail & Related papers (2025-04-29T09:24:30Z)
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners [41.22438639369124]
InfiGUI-R1 is an MLLM-based GUI agent developed through our Actor2Reasoner framework. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs. We refine the basic reasoner into a deliberative one using Reinforcement Learning.
arXiv Detail & Related papers (2025-04-19T09:25:55Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next token prediction is the fundamental principle for training large language models (LLMs) We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset. We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [16.72683291432717]
name is the first reinforcement learning framework designed to enhance the capabilities of LVLMs in high-level real-world task scenarios. name achieves superior performance using only 0.02% of the data compared to previous state-of-the-art methods like OS-Atlas.
arXiv Detail & Related papers (2025-04-14T17:45:54Z)
Improved Visual-Spatial Reasoning via R1-Zero-Like Training [18.61987706753493]
Video visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of multi-modal large language models (MLLMs) This work conducts a first, in-depth study on improving the visualspatial reasoning of MLLMs via R1-Zero-like training.
arXiv Detail & Related papers (2025-04-01T15:11:11Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.<n>Our approach only leverages a small number of samples to search for the desired pruning policy.<n>We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Visual-RFT: Visual Reinforcement Fine-Tuning [75.20572976629646]
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers. Visual-RFT further extends the application areas of RFT on visual tasks.
arXiv Detail & Related papers (2025-03-03T18:16:32Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions. In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)
Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z)
AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs) We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents. We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Foundation Models for Structural Health Monitoring [17.37816294594306]
We propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for Structural Health Monitoring. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training. We showcase the effectiveness of our foundation models using data from three operational viaducts.
arXiv Detail & Related papers (2024-04-03T13:32:44Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.