Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
- URL: http://arxiv.org/abs/2407.14503v1
- Date: Fri, 19 Jul 2024 17:57:59 GMT
- Title: Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
- Authors: Thomas Kwa, Drake Thomas, AdriĆ Garriga-Alonso,
- Abstract summary: We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility.
If error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model.
The pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error.
- Score: 1.0582505915332336
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.
Related papers
- Sharp Analysis for KL-Regularized Contextual Bandits and RLHF [52.519416266840814]
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning.
We show that a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient.
Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
arXiv Detail & Related papers (2024-11-07T11:22:46Z) - The Perfect Blend: Redefining RLHF with Mixture of Judges [68.58426626501883]
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM)
Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations.
We introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO)
arXiv Detail & Related papers (2024-09-30T15:06:53Z) - Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging.
In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.
We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [50.808123629394245]
Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline.
This work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
arXiv Detail & Related papers (2024-06-05T03:41:37Z) - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking [11.589217788048964]
We introduce a definition of reward hacking based on the correlation between proxy and true rewards for states.
We show theoretically that regularization to the base policy can effectively prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - Uncertainty-Penalized Reinforcement Learning from Human Feedback with
Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs)
In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization.
We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z) - Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.