Defining and Characterizing Reward Hacking
- URL: http://arxiv.org/abs/2209.13085v1
- Date: Tue, 27 Sep 2022 00:32:44 GMT
- Title: Defining and Characterizing Reward Hacking
- Authors: Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David
Krueger
- Abstract summary: We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return.
In particular, for the set of all policies, two reward functions can only be unhackable if one of them is constant.
Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.
- Score: 3.385988109683852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide the first formal definition of reward hacking, a phenomenon where
optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to
poor performance according to the true reward function, $\mathcal{R}$. We say
that a proxy is unhackable if increasing the expected proxy return can never
decrease the expected true return. Intuitively, it might be possible to create
an unhackable proxy by leaving some terms out of the reward function (making it
"narrower") or overlooking fine-grained distinctions between roughly equivalent
outcomes, but we show this is usually not the case. A key insight is that the
linearity of reward (in state-action visit counts) makes unhackability a very
strong condition. In particular, for the set of all stochastic policies, two
reward functions can only be unhackable if one of them is constant. We thus
turn our attention to deterministic policies and finite sets of stochastic
policies, where non-trivial unhackable pairs always exist, and establish
necessary and sufficient conditions for the existence of simplifications, an
important special case of unhackability. Our results reveal a tension between
using reward functions to specify narrow tasks and aligning AI systems with
human values.
Related papers
- Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$.
This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z) - (Quantum) Indifferentiability and Pre-Computation [50.06591179629447]
Indifferentiability is a cryptographic paradigm for analyzing the security of ideal objects.
Despite its strength, indifferentiability is not known to offer security against pre-processing attacks.
We propose a strengthening of indifferentiability which is not only composable but also takes arbitrary pre-computation into account.
arXiv Detail & Related papers (2024-10-22T00:41:47Z) - Inception: Efficiently Computable Misinformation Attacks on Markov Games [14.491458698581038]
We study security threats to Markov games due to information asymmetry and misinformation.
We derive the victim's policy under worst-case rationality and present-time algorithms to compute the attacker's optimal worst-case policy.
Our work exposes a security vulnerability from standard game assumptions under misinformation.
arXiv Detail & Related papers (2024-06-24T20:01:43Z) - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking [11.589217788048964]
We introduce a definition of reward hacking based on the correlation between proxy and true rewards for states.
We show theoretically that regularization to the base policy can effectively prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Learning Stationary Nash Equilibrium Policies in $n$-Player Stochastic
Games with Independent Chains [2.132096006921048]
We consider a class of $n$-player games in which players have their own internal state/action spaces while they are coupled through their payoff functions.
For this class of games, we first show that finding a stationary Nash (NE) policy without any assumption on the reward functions is interactable.
We develop algorithms based on dual averaging and dual mirror descent, which converge to the set of $epsilon$-NE policies.
arXiv Detail & Related papers (2022-01-28T16:27:21Z) - Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world.
Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it.
We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - Quantifying Differences in Reward Functions [24.66221171351157]
We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly.
We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
arXiv Detail & Related papers (2020-06-24T17:35:15Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.