Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
- URL: http://arxiv.org/abs/2407.14503v1
- Date: Fri, 19 Jul 2024 17:57:59 GMT
- Title: Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
- Authors: Thomas Kwa, Drake Thomas, AdriĆ Garriga-Alonso,
- Abstract summary: We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility.
If error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model.
The pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error.
- Score: 1.0582505915332336
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.
Related papers
- WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging.
In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.
We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [50.808123629394245]
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs)
This work formulates and formalizes the reward over-optimization or hacking problem for DAAs.
We find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed.
arXiv Detail & Related papers (2024-06-05T03:41:37Z) - Preventing Reward Hacking with Occupancy Measure Regularization [13.02511938180832]
Reward hacking occurs when an agent performs poorly with respect to the unknown true reward.
We propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Uncertainty-Penalized Reinforcement Learning from Human Feedback with
Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs)
In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization.
We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
Reward Hacking [63.666119126351965]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z) - RL with KL penalties is better viewed as Bayesian inference [4.473139775790299]
We analyze challenges associated with treating a language model as anReinforcement Learning policy.
We show how avoiding those challenges requires moving beyond the RL paradigm.
arXiv Detail & Related papers (2022-05-23T12:47:13Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.