RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training
- URL: http://arxiv.org/abs/2509.21009v1
- Date: Thu, 25 Sep 2025 11:13:22 GMT
- Title: RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training
- Authors: Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang,
- Abstract summary: Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise accuracy training.<n>We introduce tail, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds)<n>RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5
- Score: 19.00988498482758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.
Related papers
- Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z) - Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning [6.742598086990326]
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks.<n>We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt.<n>Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding.
arXiv Detail & Related papers (2025-11-18T16:12:21Z) - Beat the long tail: Distribution-Aware Speculative Decoding for RL Training [75.75462952580796]
We propose a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs.<n>Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves.
arXiv Detail & Related papers (2025-11-17T19:02:12Z) - CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling [11.252930904797]
We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
arXiv Detail & Related papers (2025-11-05T11:39:32Z) - RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs [48.94639777633359]
We present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources.<n> RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
arXiv Detail & Related papers (2025-10-22T04:19:37Z) - Laminar: A Scalable Asynchronous RL Post-Training Framework [20.127034898123508]
Long-tail skewness in RL trajectory generation causes severe GPU underutilization.<n>Current RL systems rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule.<n>We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture.
arXiv Detail & Related papers (2025-10-14T15:29:14Z) - QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs [80.76334908639745]
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs)<n>QeRL addresses issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA)<n>Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase.
arXiv Detail & Related papers (2025-10-13T17:55:09Z) - APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation [40.120847511378365]
Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs)<n>We propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency.<n>APRIL improves rollout throughput by at most 44% across commonly used RL algorithms.
arXiv Detail & Related papers (2025-09-23T01:32:36Z) - History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL [14.506189610798929]
Reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations.<n>First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine.<n>Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy.
arXiv Detail & Related papers (2025-08-26T01:42:46Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning [23.24949857136035]
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs)<n>We present AReaL, a fully asynchronous RL system that completely decouples generation from training.
arXiv Detail & Related papers (2025-05-30T07:18:25Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - High-Throughput Synchronous Deep RL [132.43861715707905]
We propose High-Throughput Synchronous Deep Reinforcement Learning (HTS-RL)
We perform learning and rollouts concurrently, devise a system design which avoids stale policies'
We evaluate our approach on Atari games and the Google Research Football environment.
arXiv Detail & Related papers (2020-12-17T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.