Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
- URL: http://arxiv.org/abs/2410.18252v1
- Date: Wed, 23 Oct 2024 19:59:50 GMT
- Title: Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
- Authors: Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville,
- Abstract summary: We propose separating generation and learning in RLHF.
Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
- Score: 11.624678008637623
- License:
- Abstract: The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Related papers
- MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences.
We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process.
We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z) - Enhancing Spectrum Efficiency in 6G Satellite Networks: A GAIL-Powered Policy Learning via Asynchronous Federated Inverse Reinforcement Learning [67.95280175998792]
A novel adversarial imitation learning (GAIL)-powered policy learning approach is proposed for optimizing beamforming, spectrum allocation, and remote user equipment (RUE) association ins.
We employ inverse RL (IRL) to automatically learn reward functions without manual tuning.
We show that the proposed MA-AL method outperforms traditional RL approaches, achieving a $14.6%$ improvement in convergence and reward value.
arXiv Detail & Related papers (2024-09-27T13:05:02Z) - OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [11.745186056668295]
We present OpenRLHF, an open-source framework enabling efficient RLHF scaling.
OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed.
Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts.
arXiv Detail & Related papers (2024-05-20T01:04:40Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators.
In this work, we pivot to Reinforcement Learning -- but with a twist.
We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z) - ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - High-Throughput Synchronous Deep RL [132.43861715707905]
We propose High-Throughput Synchronous Deep Reinforcement Learning (HTS-RL)
We perform learning and rollouts concurrently, devise a system design which avoids stale policies'
We evaluate our approach on Atari games and the Google Research Football environment.
arXiv Detail & Related papers (2020-12-17T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.