Related papers: Preventing Reward Hacking with Occupancy Measure Regularization

Preventing Reward Hacking with Occupancy Measure Regularization

URL: http://arxiv.org/abs/2403.03185v1
Date: Tue, 5 Mar 2024 18:22:15 GMT
Title: Preventing Reward Hacking with Occupancy Measure Regularization
Authors: Cassidy Laidlaw, Shivam Singhal, Anca Dragan
Abstract summary: Reward hacking occurs when an agent performs poorly with respect to the unknown true reward. We propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking.
Score: 13.02511938180832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward hacking occurs when an agent performs very well with respect to a "proxy" reward function (which may be hand-specified or learned), but poorly with respect to the unknown true reward. Since ensuring good alignment between the proxy and true reward is extremely difficult, one approach to prevent reward hacking is optimizing the proxy conservatively. Prior work has particularly focused on enforcing the learned policy to behave similarly to a "safe" policy by penalizing the KL divergence between their action distributions (AD). However, AD regularization doesn't always work well since a small change in action distribution at a single state can lead to potentially calamitous outcomes, while large changes might not be indicative of any dangerous activity. Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking. We theoretically establish that OM regularization can more effectively avoid large drops in true reward. Then, we empirically demonstrate in a variety of realistic environments that OM divergence is superior to AD divergence for preventing reward hacking by regularizing towards a safe policy. Furthermore, we show that occupancy measure divergence can also regularize learned policies away from reward hacking behavior. Our code and data are available at https://github.com/cassidylaidlaw/orpo

Related papers

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification [1.0582505915332336]
We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. If error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model. The pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error.
arXiv Detail & Related papers (2024-07-19T17:57:59Z)
WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP) WARP merges policies in the weight space at three distinct stages. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z)
Safety Margins for Reinforcement Learning [74.13100479426424]
We show how to leverage proxy criticality metrics to generate safety margins. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
Provable Defense against Backdoor Policies in Reinforcement Learning [35.908468039596734]
A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption.
arXiv Detail & Related papers (2022-11-18T23:12:24Z)
Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z)
Defining and Characterizing Reward Hacking [3.385988109683852]
We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. In particular, for the set of all policies, two reward functions can only be unhackable if one of them is constant. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.
arXiv Detail & Related papers (2022-09-27T00:32:44Z)
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences [33.471102483095315]
We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
arXiv Detail & Related papers (2021-07-17T17:09:18Z)
Adaptive Discretization for Adversarial Lipschitz Bandits [85.39106976861702]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces. A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions. We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.