Power-seeking can be probable and predictive for trained agents
- URL: http://arxiv.org/abs/2304.06528v1
- Date: Thu, 13 Apr 2023 13:29:01 GMT
- Title: Power-seeking can be probable and predictive for trained agents
- Authors: Victoria Krakovna and Janos Kramar
- Abstract summary: Power-seeking behavior is a key source of risk from advanced AI.
We investigate how the training process affects power-seeking incentives.
We show that power-seeking incentives can be probable and predictive.
- Score: 3.616948583169635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Power-seeking behavior is a key source of risk from advanced AI, but our
theoretical understanding of this phenomenon is relatively limited. Building on
existing theoretical results demonstrating power-seeking incentives for most
reward functions, we investigate how the training process affects power-seeking
incentives and show that they are still likely to hold for trained agents under
some simplifying assumptions. We formally define the training-compatible goal
set (the set of goals consistent with the training rewards) and assume that the
trained agent learns a goal from this set. In a setting where the trained agent
faces a choice to shut down or avoid shutdown in a new situation, we prove that
the agent is likely to avoid shutdown. Thus, we show that power-seeking
incentives can be probable (likely to arise for trained agents) and predictive
(allowing us to predict undesirable behavior in new situations).
Related papers
- Getting By Goal Misgeneralization With a Little Help From a Mentor [5.012314384895538]
This paper explores whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue.
We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization.
We find that methods based on the agent's internal state fail to proactively request help, instead waiting until mistakes have already occurred.
arXiv Detail & Related papers (2024-10-28T14:07:41Z) - Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning [16.350898218047405]
We propose a new class of backdoor attacks against Deep Reinforcement Learning (DRL) algorithms.
These attacks achieve state of the art performance while minimally altering the agent's rewards.
We then devise an online attack which significantly out-performs prior attacks under bounded reward constraints.
arXiv Detail & Related papers (2024-10-17T19:50:28Z) - Performative Prediction on Games and Mechanism Design [69.7933059664256]
We study a collective risk dilemma where agents decide whether to trust predictions based on past accuracy.
As predictions shape collective outcomes, social welfare arises naturally as a metric of concern.
We show how to achieve better trade-offs and use them for mechanism design.
arXiv Detail & Related papers (2024-08-09T16:03:44Z) - Parametrically Retargetable Decision-Makers Tend To Seek Power [91.93765604105025]
In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive.
We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment.
We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power.
arXiv Detail & Related papers (2022-06-27T17:39:23Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - Heterogeneous-Agent Trajectory Forecasting Incorporating Class
Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities.
We additionally present PUP, a new challenging real-world autonomous driving dataset.
We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z) - Learning to Incentivize Other Learning Agents [73.03133692589532]
We show how to equip RL agents with the ability to give rewards directly to other agents, using a learned incentive function.
Such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games.
Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.
arXiv Detail & Related papers (2020-06-10T20:12:38Z) - Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal
Agent [21.548271801592907]
Reinforcement learners are agents that learn to pick actions that lead to high reward.
We show that if an agent is guaranteed to be "asymptotically optimal" in any environment, then subject to an assumption about the true environment, this agent will be either "destroyed" or "incapacitated"
We present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration.
arXiv Detail & Related papers (2020-06-05T10:42:29Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.