RLHF Workflow: From Reward Modeling to Online RLHF
- URL: http://arxiv.org/abs/2405.07863v2
- Date: Wed, 12 Jun 2024 04:40:53 GMT
- Title: RLHF Workflow: From Reward Modeling to Online RLHF
- Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang,
- Abstract summary: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
- Score: 79.83927049253924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, LLaMA-3-8B-SFR-Iterative-DPO-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.
Related papers
- Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.
Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [11.556630218410444]
We present OpenRLHF, an open-source framework enabling efficient RLHF scaling.
OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed.
Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts.
arXiv Detail & Related papers (2024-05-20T01:04:40Z) - TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators.
In this work, we pivot to Reinforcement Learning -- but with a twist.
We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z) - RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback [5.3113139864044046]
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive.
RLAIF offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM.
Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
arXiv Detail & Related papers (2023-09-01T05:53:33Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.