Scaling Laws for Reward Model Overoptimization
- URL: http://arxiv.org/abs/2210.10760v1
- Date: Wed, 19 Oct 2022 17:56:10 GMT
- Title: Scaling Laws for Reward Model Overoptimization
- Authors: Leo Gao, John Schulman, Jacob Hilton
- Abstract summary: We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
- Score: 19.93331579503503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning from human feedback, it is common to optimize
against a reward model trained to predict human preferences. Because the reward
model is an imperfect proxy, optimizing its value too much can hinder ground
truth performance, in accordance with Goodhart's law. This effect has been
frequently observed, but not carefully measured due to the expense of
collecting human preference data. In this work, we use a synthetic setup in
which a fixed "gold-standard" reward model plays the role of humans, providing
labels used to train a proxy reward model. We study how the gold reward model
score changes as we optimize against the proxy reward model using either
reinforcement learning or best-of-$n$ sampling. We find that this relationship
follows a different functional form depending on the method of optimization,
and that in both cases its coefficients scale smoothly with the number of
reward model parameters. We also study the effect on this relationship of the
size of the reward model dataset, the number of reward model and policy
parameters, and the coefficient of the KL penalty added to the reward in the
reinforcement learning setup. We explore the implications of these empirical
results for theoretical considerations in AI alignment.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.