Preventing Reward Hacking with Occupancy Measure Regularization
- URL: http://arxiv.org/abs/2403.03185v1
- Date: Tue, 5 Mar 2024 18:22:15 GMT
- Title: Preventing Reward Hacking with Occupancy Measure Regularization
- Authors: Cassidy Laidlaw, Shivam Singhal, Anca Dragan
- Abstract summary: Reward hacking occurs when an agent performs poorly with respect to the unknown true reward.
We propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking.
- Score: 13.02511938180832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward hacking occurs when an agent performs very well with respect to a
"proxy" reward function (which may be hand-specified or learned), but poorly
with respect to the unknown true reward. Since ensuring good alignment between
the proxy and true reward is extremely difficult, one approach to prevent
reward hacking is optimizing the proxy conservatively. Prior work has
particularly focused on enforcing the learned policy to behave similarly to a
"safe" policy by penalizing the KL divergence between their action
distributions (AD). However, AD regularization doesn't always work well since a
small change in action distribution at a single state can lead to potentially
calamitous outcomes, while large changes might not be indicative of any
dangerous activity. Our insight is that when reward hacking, the agent visits
drastically different states from those reached by the safe policy, causing
large deviations in state occupancy measure (OM). Thus, we propose regularizing
based on the OM divergence between policies instead of AD divergence to prevent
reward hacking. We theoretically establish that OM regularization can more
effectively avoid large drops in true reward. Then, we empirically demonstrate
in a variety of realistic environments that OM divergence is superior to AD
divergence for preventing reward hacking by regularizing towards a safe policy.
Furthermore, we show that occupancy measure divergence can also regularize
learned policies away from reward hacking behavior. Our code and data are
available at https://github.com/cassidylaidlaw/orpo
Related papers
- Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification [1.0582505915332336]
We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility.
If error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model.
The pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error.
arXiv Detail & Related papers (2024-07-19T17:57:59Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Safety Margins for Reinforcement Learning [74.13100479426424]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z) - Provable Defense against Backdoor Policies in Reinforcement Learning [35.908468039596734]
A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers.
We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption.
arXiv Detail & Related papers (2022-11-18T23:12:24Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Defining and Characterizing Reward Hacking [3.385988109683852]
We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return.
In particular, for the set of all policies, two reward functions can only be unhackable if one of them is constant.
Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.
arXiv Detail & Related papers (2022-09-27T00:32:44Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences [33.471102483095315]
We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values.
We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy.
No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
arXiv Detail & Related papers (2021-07-17T17:09:18Z) - Adaptive Discretization for Adversarial Lipschitz Bandits [85.39106976861702]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces.
A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions.
We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.