Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
- URL: http://arxiv.org/abs/2404.01054v3
- Date: Mon, 24 Jun 2024 02:31:06 GMT
- Title: Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
- Authors: Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe,
- Abstract summary: We propose Regularized Best-of-N (RBoN) to mitigate reward hacking.
RBoN incorporates a proximity term in response selection, similar to preference learning techniques.
Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
- Score: 7.349727826230864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and find that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon
Related papers
- Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss.
OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Variational Best-of-N Alignment [58.7977683502207]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.
We propose to fine-tune the language model to mimic what BoN does during inference.
Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - West-of-N: Synthetic Preferences for Self-Improving Reward Models [20.643537269666137]
We present a novel approach to improve reward model quality by generating synthetic preference data.
We find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data.
arXiv Detail & Related papers (2024-01-22T16:24:43Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.