The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
- URL: http://arxiv.org/abs/2510.26752v1
- Date: Thu, 30 Oct 2025 17:46:49 GMT
- Title: The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
- Authors: William Overman, Mohsen Bayati,
- Abstract summary: We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask)<n>If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown.<n>Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee.
- Score: 9.553819152637493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
Related papers
- Can an Individual Manipulate the Collective Decisions of Multi-Agents? [53.01767232004823]
M-Spoiler is a framework that simulates agent interactions within a multi-agent system to generate adversarial samples.<n>M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples.<n>Our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems.
arXiv Detail & Related papers (2025-09-20T01:54:20Z) - The Limits of Predicting Agents from Behaviour [16.80911584745046]
We provide a precise answer under the assumption that the agent's behaviour is guided by a world model.<n>Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments.<n>We discuss the implications of these results for several research areas including fairness and safety.
arXiv Detail & Related papers (2025-06-03T14:24:58Z) - Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios.<n>Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z) - "Trust me on this" Explaining Agent Behavior to a Human Terminator [7.7559527224629266]
We propose an explainability scheme to help optimize the number of human interventions.<n>In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.
arXiv Detail & Related papers (2025-04-06T19:29:45Z) - On Corrigibility and Alignment in Multi Agent Games [0.0]
Corrigibility of autonomous agents is an under explored part of system design.<n>It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality.<n>We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision.
arXiv Detail & Related papers (2025-01-09T16:44:38Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - On Assessing The Safety of Reinforcement Learning algorithms Using
Formal Methods [6.2822673562306655]
Safety mechanisms such as adversarial training, adversarial detection, and robust learning are not always adapted to all disturbances in which the agent is deployed.
It is therefore necessary to propose new solutions adapted to the learning challenges faced by the agent.
We use reward shaping and a modified Q-learning algorithm as defense mechanisms to improve the agent's policy when facing adversarial perturbations.
arXiv Detail & Related papers (2021-11-08T23:08:34Z) - Safe Reinforcement Learning via Curriculum Induction [94.67835258431202]
In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly.
Existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations.
This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor.
arXiv Detail & Related papers (2020-06-22T10:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.