On Corrigibility and Alignment in Multi Agent Games
- URL: http://arxiv.org/abs/2501.05360v1
- Date: Thu, 09 Jan 2025 16:44:38 GMT
- Title: On Corrigibility and Alignment in Multi Agent Games
- Authors: Edmund Dable-Heath, Boyko Vodenicharski, James Bishop,
- Abstract summary: Corrigibility of autonomous agents is an under explored part of system design.
It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality.
We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision.
- Score: 0.0
- License:
- Abstract: Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.
Related papers
- Safe Exploitative Play with Untrusted Type Beliefs [21.177698937011183]
We study the idea of controlling a single agent in a system composed of multiple agents with unknown behaviors given a set of types.
The type beliefs are often learned from past actions and likely to be incorrect.
We define a tradeoff between risk and opportunity by comparing the payoff obtained against the optimal payoff.
arXiv Detail & Related papers (2024-11-12T09:49:16Z) - Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - Cooperation and Control in Delegation Games [1.3518297878940662]
We study multi-principal, multi-agent scenarios as delegation games.
In such games, there are two important failure modes: problems of control and problems of cooperation.
We show -- theoretically and empirically -- how these measures determine the principals' welfare.
arXiv Detail & Related papers (2024-02-24T14:17:41Z) - Toward Human-AI Alignment in Large-Scale Multi-Player Games [24.784173202415687]
We analyze extensive human gameplay data from Xbox's Bleeding Edge (100K+ games)
We find that while human players exhibit variability in fight-flight and explore-exploit behavior, AI players tend towards uniformity.
These stark differences underscore the need for interpretable evaluation, design, and integration of AI in human-aligned applications.
arXiv Detail & Related papers (2024-02-05T22:55:33Z) - Game-theoretic Objective Space Planning [4.989480853499916]
Understanding intent of other agents is crucial to deploying autonomous systems in adversarial multi-agent environments.
Current approaches either oversimplify the discretization of the action space of agents or fail to recognize the long-term effect of actions and become myopic.
We propose a novel dimension reduction method that encapsulates diverse agent behaviors while conserving the continuity of agent actions.
arXiv Detail & Related papers (2022-09-16T07:35:20Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - How and Why to Manipulate Your Own Agent [5.634825161148484]
We consider strategic settings where several users engage in a repeated online interaction, assisted by regret-minimizing agents that repeatedly play a "game" on their behalf.
We study the dynamics and average outcomes of the repeated game of the agents, and view it as inducing a meta-game between the users.
arXiv Detail & Related papers (2021-12-14T18:35:32Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - End-to-End Learning and Intervention in Games [60.41921763076017]
We provide a unified framework for learning and intervention in games.
We propose two approaches, respectively based on explicit and implicit differentiation.
The analytical results are validated using several real-world problems.
arXiv Detail & Related papers (2020-10-26T18:39:32Z) - Moody Learners -- Explaining Competitive Behaviour of Reinforcement
Learning Agents [65.2200847818153]
In a competitive scenario, the agent does not only have a dynamic environment but also is directly affected by the opponents' actions.
Observing the Q-values of the agent is usually a way of explaining its behavior, however, do not show the temporal-relation between the selected actions.
arXiv Detail & Related papers (2020-07-30T11:30:42Z) - Learning to Incentivize Other Learning Agents [73.03133692589532]
We show how to equip RL agents with the ability to give rewards directly to other agents, using a learned incentive function.
Such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games.
Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.
arXiv Detail & Related papers (2020-06-10T20:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.