StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
- URL: http://arxiv.org/abs/2504.15930v1
- Date: Tue, 22 Apr 2025 14:19:06 GMT
- Title: StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
- Authors: Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang,
- Abstract summary: Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
- Score: 55.75008325187133
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.
Related papers
- Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms [4.127488674019288]
Post-training for large language models co-locates trajectory sampling and policy optimisation on the same GPU cluster.<n>We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms.
arXiv Detail & Related papers (2025-08-07T13:37:04Z) - MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster [6.589537564035392]
Reinforcement learning (RL) is a paradigm increasingly used to align large language models.<n>In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training.
arXiv Detail & Related papers (2025-07-25T07:11:49Z) - AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training [24.60677187852425]
Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs)<n>Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks.<n>Task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance.<n>We propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training.
arXiv Detail & Related papers (2025-07-02T12:45:34Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning [26.103555014247117]
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs)<n>We present AReaL, a fully asynchronous RL system that completely decouples generation from training.
arXiv Detail & Related papers (2025-05-30T07:18:25Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z) - The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution [20.926218346718482]
We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution.
We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$times$ compared to traditional batch and stream processing systems.
arXiv Detail & Related papers (2025-01-16T19:54:01Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z) - Efficient Parallel Reinforcement Learning Framework using the Reactor
Model [2.190190313041532]
Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources.
Existing frameworks, such as Ray, are not managing this orchestration efficiently.
We have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern.
arXiv Detail & Related papers (2023-12-07T21:19:57Z) - SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores [13.948640763797776]
We present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework.
We develop a scalable, efficient, and distributed RL system called ReaLly scalableRL, which allows efficient and massively parallelized training.
SRL is the first in the academic community to perform RL experiments at a large scale with over 15k CPU cores.
arXiv Detail & Related papers (2023-06-29T05:16:25Z) - Dual Generator Offline Reinforcement Learning [90.05278061564198]
In offline RL, constraining the learned policy to remain close to the data is essential.
In practice, GAN-based offline RL methods have not performed as well as alternative approaches.
We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint.
arXiv Detail & Related papers (2022-11-02T20:25:18Z) - MSRL: Distributed Reinforcement Learning with Dataflow Fragments [16.867322708270116]
Reinforcement learning (RL) trains many agents, which is resource-intensive and must scale to large GPU clusters.
We describe MindSpore Reinforcement Learning (MSRL), a distributed RL training system that supports distribution policies that govern how RL training is parallelised and distributed on cluster resources.
MSRL introduces the new abstraction of a fragmented dataflow graph, which maps functions from an RL algorithm's training loop to parallel computational fragments.
arXiv Detail & Related papers (2022-10-03T12:34:58Z) - RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem [37.38316954355031]
We re-examine the challenges posed by distributed reinforcement learning.
We show that viewing RL as a dataflow problem leads to highly composable and performant implementations.
We propose RLlib Flow, a hybrid actor-dataflow programming model for distributed RL.
arXiv Detail & Related papers (2020-11-25T13:28:16Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.