Path-Specific Objectives for Safer Agent Incentives
- URL: http://arxiv.org/abs/2204.10018v1
- Date: Thu, 21 Apr 2022 11:01:31 GMT
- Title: Path-Specific Objectives for Safer Agent Incentives
- Authors: Sebastian Farquhar, Ryan Carey, Tom Everitt
- Abstract summary: We describe settings with 'delicate' parts of the state which should not be used as a means to an end.
We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state.
The resulting agents have no incentive to control the delicate state.
- Score: 15.759504531768219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a general framework for training safe agents whose naive
incentives are unsafe. As an example, manipulative or deceptive behaviour can
improve rewards but should be avoided. Most approaches fail here: agents
maximize expected return by any means necessary. We formally describe settings
with 'delicate' parts of the state which should not be used as a means to an
end. We then train agents to maximize the causal effect of actions on the
expected return which is not mediated by the delicate parts of state, using
Causal Influence Diagram analysis. The resulting agents have no incentive to
control the delicate state. We further show how our framework unifies and
generalizes existing proposals.
Related papers
- Identifying and Addressing Delusions for Target-Directed Decision-Making [81.22463009144987]
We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes.
We identify different types of delusions via intuitive examples in controlled environments, and investigate their causes and mitigations.
We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
arXiv Detail & Related papers (2024-10-09T17:35:25Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Can Agents Run Relay Race with Strangers? Generalization of RL to
Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents.
We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Cursed yet Satisfied Agents [15.104201344012344]
Winner's high bid implies that the winner often over-estimates the value of the good for sale, resulting in an incurred negative utility.
We propose mechanisms that incentivize agents to bid their true signal even though they are cursed.
arXiv Detail & Related papers (2021-04-02T01:15:53Z) - Pessimism About Unknown Unknowns Inspires Conservatism [24.085795452335145]
We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models.
A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account.
Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
arXiv Detail & Related papers (2020-06-15T20:46:33Z) - Bounded Incentives in Manipulating the Probabilistic Serial Rule [8.309903898123526]
Probabilistic Serial is not incentive-compatible.
A substantial utility gain through strategic behaviors would trigger self-interested agents to manipulate the mechanism.
We show that the incentive ratio of the mechanism is $frac32$.
arXiv Detail & Related papers (2020-01-28T23:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.