SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent
- URL: http://arxiv.org/abs/2511.16108v1
- Date: Thu, 20 Nov 2025 07:05:19 GMT
- Title: SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent
- Authors: Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica,
- Abstract summary: We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation.<n>It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability.<n>We train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning.
- Score: 63.15417992240217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
Related papers
- RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z) - AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework [76.96794548655292]
Large language models (LLMs) have sparked growing interest in building generalist agents that can learn through online interactions.<n>Applying reinforcement learning (RL) to train LLM agents in multi-turn, multi-task settings remains challenging due to lack of scalable infrastructure and stable training algorithms.<n>We present the AgentRL framework for scalable multi-turn, multi-task agentic RL training.
arXiv Detail & Related papers (2025-10-05T13:40:01Z) - Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z) - rStar2-Agent: Agentic Reasoning Technical Report [25.266747156205266]
We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance.<n>To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25.
arXiv Detail & Related papers (2025-08-28T12:45:25Z) - How to Train Your LLM Web Agent: A Statistical Diagnosis [96.86317871461834]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z) - DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents [38.0441002097771]
DistRL is a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents.<n>On average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods.
arXiv Detail & Related papers (2024-10-18T18:19:56Z) - Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control Priors [5.004576576202551]
We propose a new adaptive hybrid Reinforcement Learning algorithm, Contextualized Hybrid Ensemble Q-learning (CHEQ)
CHEQ combines three key ingredients: (i) a time-invariant formulation of the adaptive hybrid RL problem treating the adaptive weight as a context variable, (ii) a weight adaption mechanism based on the parametric uncertainty of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient RL.
evaluating CHEQ on a car racing task reveals substantially stronger data efficiency, exploration safety, and transferability to unknown scenarios than state-of-the-art adaptive hybrid RL methods.
arXiv Detail & Related papers (2024-06-28T09:17:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.