OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
- URL: http://arxiv.org/abs/2405.11143v3
- Date: Wed, 17 Jul 2024 09:18:35 GMT
- Title: OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
- Authors: Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, Yu Cao,
- Abstract summary: We present OpenRLHF, an open-source framework enabling efficient RLHF scaling.
OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed.
Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts.
- Score: 11.556630218410444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training large language models poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF's code is available at \url{https://github.com/OpenRLHF/OpenRLHF}.
Related papers
- Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.
Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - The Perfect Blend: Redefining RLHF with Mixture of Judges [68.58426626501883]
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM)
Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations.
We introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO)
arXiv Detail & Related papers (2024-09-30T15:06:53Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences.
To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced.
We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - SuperHF: Supervised Iterative Learning from Human Feedback [20.22920163075946]
We focus on two prevalent methods used to align large language models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)
We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods.
Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement.
arXiv Detail & Related papers (2023-10-25T16:52:00Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.