West-of-N: Synthetic Preference Generation for Improved Reward Modeling
- URL: http://arxiv.org/abs/2401.12086v1
- Date: Mon, 22 Jan 2024 16:24:43 GMT
- Title: West-of-N: Synthetic Preference Generation for Improved Reward Modeling
- Authors: Aliz\'ee Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause,
Aliaksei Severyn
- Abstract summary: We present a novel approach to improve reward model quality by generating synthetic preference data.
We find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data.
- Score: 20.897381726408838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of reinforcement learning from human feedback (RLHF) in language
model alignment is strongly dependent on the quality of the underlying reward
model. In this paper, we present a novel approach to improve reward model
quality by generating synthetic preference data, thereby augmenting the
training dataset with on-policy, high-quality preference pairs. Motivated by
the promising results of Best-of-N sampling strategies in language model
training, we extend their application to reward model training. This results in
a self-training strategy to generate preference pairs by selecting the best and
worst candidates in a pool of responses to a given query. Empirically, we find
that this approach improves the performance of any reward model, with an effect
comparable to the addition of a similar quantity of human preference data. This
work opens up new avenues of research for improving RLHF for language model
alignment, by offering synthetic preference generation as a solution to reward
modeling challenges.
Related papers
- Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning.
From the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [25.011675414622392]
Reward models trained on human preference data have been proven to be effective for aligning Large Language Models with human intent.
However, the generalization capabilities of current reward models to unseen prompts and responses are limited.
Our study proposes a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states.
arXiv Detail & Related papers (2024-06-14T17:49:59Z) - Towards Understanding the Influence of Reward Margin on Preference Model Performance [8.891183078634786]
This study introduces a novel method to estimate the preference differences without the need for detailed, exhaustive labels from human annotators.
Our experimental results provide empirical evidence that incorporating margin values into the training process significantly improves the effectiveness of reward models.
arXiv Detail & Related papers (2024-04-07T12:10:04Z) - Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent.
However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models.
We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [32.752633250862694]
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data.
We introduce a new framework, Reward rAnked FineTuning, designed to align generative models effectively.
arXiv Detail & Related papers (2023-04-13T18:22:40Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.