Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
- URL: http://arxiv.org/abs/2509.23866v1
- Date: Sun, 28 Sep 2025 13:19:20 GMT
- Title: Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
- Authors: Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li,
- Abstract summary: Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
- Score: 65.3648667980258
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
Related papers
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents [23.583947864141162]
EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
arXiv Detail & Related papers (2026-01-30T06:01:23Z) - UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning [78.86567400365392]
We present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories.<n>To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation.<n>Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks.
arXiv Detail & Related papers (2025-09-15T03:24:08Z) - MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents [36.99267272275733]
We present an online agentic reinforcement learning framework MOBILERL to enhance GUI agents in mobile environments.<n>Its core component is the Difficulty-Adaptive GRPO (ADAGRPO) algorithm.<n>We introduce the shortest path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks.<n>The resultant MOBILERL-9B model achieves state-of-the-art results on both AndroidWorld (75.8%) and AndroidLab (46.8%)
arXiv Detail & Related papers (2025-09-10T13:09:27Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents [26.75149498779202]
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully.<n>ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments.<n>We propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs.
arXiv Detail & Related papers (2025-08-19T17:59:45Z) - CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks [11.121687042616974]
Reinforcement Learning (RL) can effectively enhance agents' performance in dynamic interactive GUI environments.<n>Most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates.<n>We propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories.
arXiv Detail & Related papers (2025-08-15T09:55:02Z) - Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards [29.467847768352282]
We introduce Mobile-R1, which employs interactive multi-turn reinforcement learning with task-level rewards for mobile agents.<n>Our training framework consists of three stages: initial format finetuning, single-step online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories.<n>This strategy is designed to enhance the exploration and error correction capabilities of Mobile-R1, leading to significant performance improvements.
arXiv Detail & Related papers (2025-06-25T11:34:43Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents.
We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator.
We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z) - Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task.
The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance.
We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.