Related papers: Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

URL: http://arxiv.org/abs/2410.18252v1
Date: Wed, 23 Oct 2024 19:59:50 GMT
Title: Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Authors: Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville,
Abstract summary: We propose separating generation and learning in RLHF. Asynchronous training relies on an underexplored regime, online but off-policy RLHF. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
Score: 11.624678008637623
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

Related papers

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z)
Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot [0.0]
reinforcement learning (RL) agents learn by periodically conducting computationally expensive gradient updates. In a rapidly changing environment, increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods separate the computation of decision-making and gradient updates. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL.
arXiv Detail & Related papers (2025-03-17T22:24:39Z)
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models. We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z)
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z)
Enhancing Spectrum Efficiency in 6G Satellite Networks: A GAIL-Powered Policy Learning via Asynchronous Federated Inverse Reinforcement Learning [67.95280175998792]
A novel adversarial imitation learning (GAIL)-powered policy learning approach is proposed for optimizing beamforming, spectrum allocation, and remote user equipment (RUE) association ins. We employ inverse RL (IRL) to automatically learn reward functions without manual tuning. We show that the proposed MA-AL method outperforms traditional RL approaches, achieving a $14.6%$ improvement in convergence and reward value.
arXiv Detail & Related papers (2024-09-27T13:05:02Z)
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [12.321332446941378]
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. We introduce ReaL, a pioneering system for efficient RLHF training. We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs.
arXiv Detail & Related papers (2024-06-20T08:04:07Z)
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [11.745186056668295]
We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts.
arXiv Detail & Related papers (2024-05-20T01:04:40Z)
RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report. RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z)
TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators. In this work, we pivot to Reinforcement Learning -- but with a twist. We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z)
Efficient Parallel Reinforcement Learning Framework using the Reactor Model [2.190190313041532]
Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources. Existing frameworks, such as Ray, are not managing this orchestration efficiently. We have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern.
arXiv Detail & Related papers (2023-12-07T21:19:57Z)
ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
High-Throughput Synchronous Deep RL [132.43861715707905]
We propose High-Throughput Synchronous Deep Reinforcement Learning (HTS-RL) We perform learning and rollouts concurrently, devise a system design which avoids stale policies' We evaluate our approach on Atari games and the Google Research Football environment.
arXiv Detail & Related papers (2020-12-17T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.