Related papers: Path-Specific Objectives for Safer Agent Incentives

Path-Specific Objectives for Safer Agent Incentives

URL: http://arxiv.org/abs/2204.10018v1
Date: Thu, 21 Apr 2022 11:01:31 GMT
Title: Path-Specific Objectives for Safer Agent Incentives
Authors: Sebastian Farquhar, Ryan Carey, Tom Everitt
Abstract summary: We describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state. The resulting agents have no incentive to control the delicate state.
Score: 15.759504531768219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.

Related papers

Steering No-Regret Agents in MFGs under Model Uncertainty [19.845081182511713]
We study the design of steering rewards in Mean-Field Games with density-independent transitions. We establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.
arXiv Detail & Related papers (2025-03-12T12:02:02Z)
Deceptive Sequential Decision-Making via Regularized Policy Optimization [54.38738815697299]
Two regularization strategies for policy synthesis problems that actively deceive an adversary about a system's underlying rewards are presented. We show how each form of deception can be implemented in policy optimization problems. We show that diversionary deception can cause the adversary to believe that the most important agent is the least important, while attaining a total accumulated reward that is $98.83%$ of its optimal, non-deceptive value.
arXiv Detail & Related papers (2025-01-30T23:41:40Z)
Identifying and Addressing Delusions for Target-Directed Decision-Making [81.22463009144987]
We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training. We demonstrate how we can make agents address delusions preemptively and autonomously.
arXiv Detail & Related papers (2024-10-09T17:35:25Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem. Specifically, the arms are strategic agents who can improve their rewards or absorb them. We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z)
Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents. This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal. We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z)
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents. We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios. We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
Cursed yet Satisfied Agents [15.104201344012344]
Winner's high bid implies that the winner often over-estimates the value of the good for sale, resulting in an incurred negative utility. We propose mechanisms that incentivize agents to bid their true signal even though they are cursed.
arXiv Detail & Related papers (2021-04-02T01:15:53Z)
Pessimism About Unknown Unknowns Inspires Conservatism [24.085795452335145]
We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models. A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account. Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
arXiv Detail & Related papers (2020-06-15T20:46:33Z)
Bounded Incentives in Manipulating the Probabilistic Serial Rule [8.309903898123526]
Probabilistic Serial is not incentive-compatible. A substantial utility gain through strategic behaviors would trigger self-interested agents to manipulate the mechanism. We show that the incentive ratio of the mechanism is $frac32$.
arXiv Detail & Related papers (2020-01-28T23:53:37Z)
Incentivizing Exploration with Selective Data Disclosure [70.11902902106014]
We propose and design recommendation systems that incentivize efficient exploration. Agents arrive sequentially, choose actions and receive rewards, drawn from fixed but unknown action-specific distributions. We attain optimal regret rate for exploration using a flexible frequentist behavioral model.
arXiv Detail & Related papers (2018-11-14T19:29:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.