AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- URL: http://arxiv.org/abs/2505.24298v2
- Date: Wed, 04 Jun 2025 11:42:19 GMT
- Title: AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- Authors: Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu,
- Abstract summary: Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs)<n>We present AReaL, a fully asynchronous RL system that completely decouples generation from training.
- Score: 26.103555014247117
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Related papers
- Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms [4.127488674019288]
Post-training for large language models co-locates trajectory sampling and policy optimisation on the same GPU cluster.<n>We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms.
arXiv Detail & Related papers (2025-08-07T13:37:04Z) - DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training [32.575669924032276]
Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs)<n>We present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs.
arXiv Detail & Related papers (2025-05-29T22:14:15Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z) - AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.<n>Online DPO is found to be most robust to off-policy data.<n>Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [12.321332446941378]
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications.<n>We introduce ReaL, a pioneering system for efficient RLHF training.<n>We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs.
arXiv Detail & Related papers (2024-06-20T08:04:07Z) - Spreeze: High-Throughput Parallel Reinforcement Learning Framework [19.3019166138232]
Spreeze is a lightweight parallel framework for reinforcement learning.
It efficiently utilizes a single desktop hardware resource to approach the throughput limit.
It can achieve up to 15,000Hz experience sampling and 370,000Hz network update frame rate.
arXiv Detail & Related papers (2023-12-11T05:25:01Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.