Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
- URL: http://arxiv.org/abs/2401.16635v2
- Date: Tue, 21 May 2024 22:21:16 GMT
- Title: Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
- Authors: Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
- Score: 67.4269821365504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.
Related papers
- Prototypical Reward Network for Data-Efficient RLHF [17.220998116937444]
A reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs)
Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback.
arXiv Detail & Related papers (2024-06-06T15:23:30Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings.
In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z) - SuperHF: Supervised Iterative Learning from Human Feedback [20.22920163075946]
We focus on two prevalent methods used to align large language models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)
We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods.
Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement.
arXiv Detail & Related papers (2023-10-25T16:52:00Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward
Model [126.78737228677025]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - SLiC-HF: Sequence Likelihood Calibration with Human Feedback [35.74135968442311]
We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences.
Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
arXiv Detail & Related papers (2023-05-17T17:57:10Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.