Scaling Laws for Reward Model Overoptimization
- URL: http://arxiv.org/abs/2210.10760v1
- Date: Wed, 19 Oct 2022 17:56:10 GMT
- Title: Scaling Laws for Reward Model Overoptimization
- Authors: Leo Gao, John Schulman, Jacob Hilton
- Abstract summary: We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
- Score: 19.93331579503503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning from human feedback, it is common to optimize
against a reward model trained to predict human preferences. Because the reward
model is an imperfect proxy, optimizing its value too much can hinder ground
truth performance, in accordance with Goodhart's law. This effect has been
frequently observed, but not carefully measured due to the expense of
collecting human preference data. In this work, we use a synthetic setup in
which a fixed "gold-standard" reward model plays the role of humans, providing
labels used to train a proxy reward model. We study how the gold reward model
score changes as we optimize against the proxy reward model using either
reinforcement learning or best-of-$n$ sampling. We find that this relationship
follows a different functional form depending on the method of optimization,
and that in both cases its coefficients scale smoothly with the number of
reward model parameters. We also study the effect on this relationship of the
size of the reward model dataset, the number of reward model and policy
parameters, and the coefficient of the KL penalty added to the reward in the
reinforcement learning setup. We explore the implications of these empirical
results for theoretical considerations in AI alignment.
Related papers
- Towards Reliable Alignment: Uncertainty-aware RLHF [14.20181662644689]
We show that the fluctuation of reward models can be detrimental to the alignment problem.
We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards.
We use this ensemble of reward models to align language model using our methodology and observe that our empirical findings match our theoretical predictions.
arXiv Detail & Related papers (2024-10-31T08:26:51Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - Evaluating Robustness of Reward Models for Mathematical Reasoning [14.97819343313859]
We introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH.
We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization.
arXiv Detail & Related papers (2024-10-02T16:39:58Z) - Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance.
This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF.
Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - Collaborative Machine Learning with Incentive-Aware Model Rewards [32.43927226170119]
Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties.
These parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions.
This paper proposes to value a party's reward based on Shapley value and information gain on model parameters given its data.
arXiv Detail & Related papers (2020-10-24T06:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.