Path-Specific Objectives for Safer Agent Incentives
- URL: http://arxiv.org/abs/2204.10018v1
- Date: Thu, 21 Apr 2022 11:01:31 GMT
- Title: Path-Specific Objectives for Safer Agent Incentives
- Authors: Sebastian Farquhar, Ryan Carey, Tom Everitt
- Abstract summary: We describe settings with 'delicate' parts of the state which should not be used as a means to an end.
We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state.
The resulting agents have no incentive to control the delicate state.
- Score: 15.759504531768219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a general framework for training safe agents whose naive
incentives are unsafe. As an example, manipulative or deceptive behaviour can
improve rewards but should be avoided. Most approaches fail here: agents
maximize expected return by any means necessary. We formally describe settings
with 'delicate' parts of the state which should not be used as a means to an
end. We then train agents to maximize the causal effect of actions on the
expected return which is not mediated by the delicate parts of state, using
Causal Influence Diagram analysis. The resulting agents have no incentive to
control the delicate state. We further show how our framework unifies and
generalizes existing proposals.
Related papers
- CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution [49.689452243966315]
AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks.<n>We propose CausalArmor, a selective defense framework that computes lightweight, leave-one-out attributions at privileged decision points.<n> Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses.
arXiv Detail & Related papers (2026-02-08T11:34:08Z) - Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z) - Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards [5.006086647446482]
In high-stakes AI applications, even a single action can cause irreparable damage.<n>Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails.<n>We propose a caution-based algorithm that learns when not to learn.
arXiv Detail & Related papers (2025-10-16T17:01:57Z) - Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z) - Steering No-Regret Agents in MFGs under Model Uncertainty [19.845081182511713]
We study the design of steering rewards in Mean-Field Games with density-independent transitions.
We establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones.
Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.
arXiv Detail & Related papers (2025-03-12T12:02:02Z) - Deceptive Sequential Decision-Making via Regularized Policy Optimization [54.38738815697299]
Two regularization strategies for policy synthesis problems that actively deceive an adversary about a system's underlying rewards are presented.
We show how each form of deception can be implemented in policy optimization problems.
We show that diversionary deception can cause the adversary to believe that the most important agent is the least important, while attaining a total accumulated reward that is $98.83%$ of its optimal, non-deceptive value.
arXiv Detail & Related papers (2025-01-30T23:41:40Z) - Identifying and Addressing Delusions for Target-Directed Decision-Making [81.22463009144987]
We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes.
We show that these behaviors can be results of delusions, stemming from improper designs around training.
We demonstrate how we can make agents address delusions preemptively and autonomously.
arXiv Detail & Related papers (2024-10-09T17:35:25Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Can Agents Run Relay Race with Strangers? Generalization of RL to
Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents.
We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Cursed yet Satisfied Agents [15.104201344012344]
Winner's high bid implies that the winner often over-estimates the value of the good for sale, resulting in an incurred negative utility.
We propose mechanisms that incentivize agents to bid their true signal even though they are cursed.
arXiv Detail & Related papers (2021-04-02T01:15:53Z) - Pessimism About Unknown Unknowns Inspires Conservatism [24.085795452335145]
We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models.
A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account.
Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
arXiv Detail & Related papers (2020-06-15T20:46:33Z) - Bounded Incentives in Manipulating the Probabilistic Serial Rule [8.309903898123526]
Probabilistic Serial is not incentive-compatible.
A substantial utility gain through strategic behaviors would trigger self-interested agents to manipulate the mechanism.
We show that the incentive ratio of the mechanism is $frac32$.
arXiv Detail & Related papers (2020-01-28T23:53:37Z) - Incentivizing Exploration with Selective Data Disclosure [70.11902902106014]
We propose and design recommendation systems that incentivize efficient exploration.
Agents arrive sequentially, choose actions and receive rewards, drawn from fixed but unknown action-specific distributions.
We attain optimal regret rate for exploration using a flexible frequentist behavioral model.
arXiv Detail & Related papers (2018-11-14T19:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.