Related papers: The Partially Observable Off-Switch Game

The Partially Observable Off-Switch Game

URL: http://arxiv.org/abs/2411.17749v2
Date: Mon, 09 Dec 2024 07:49:53 GMT
Title: The Partially Observable Off-Switch Game
Authors: Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, Scott Emmons,
Abstract summary: A wide variety of goals could cause an AI to disable its off switch because "you can't fetch the coffee if you're dead"<n>We introduce the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic model of the shutdown problem with asymmetric information.<n>We find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown.
Score: 7.567880819525154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A wide variety of goals could cause an AI to disable its off switch because "you can't fetch the coffee if you're dead" (Russell 2019). Prior theoretical work on this shutdown problem assumes that humans know everything that AIs do. In practice, however, humans have only limited information. Moreover, in many of the settings where the shutdown problem is most concerning, AIs might have vast amounts of private information. To capture these differences in knowledge, we introduce the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic model of the shutdown problem with asymmetric information. Unlike when the human has full observability, we find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown. As expected, increasing the amount of communication or information available always increases (or leaves unchanged) the agents' expected common payoff. But counterintuitively, introducing bounded communication can make the AI defer to the human less in optimal play even though communication mitigates information asymmetry. In particular, communication sometimes enables new optimal behavior requiring strategic AI deference to achieve outcomes that were previously inaccessible. Thus, designing safe artificial agents in the presence of asymmetric information requires careful consideration of the tradeoffs between maximizing payoffs (potentially myopically) and maintaining AIs' incentives to defer to humans.

Related papers

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games [63.29377274531968]
We introduce the AI GameStore, a scalable and open-ended platform to synthesize new representative human games.<n>We generate 100 such games based on the top charts of Apple App Store and Steam, and evaluate seven frontier vision-language models (VLMs) on short episodes of play.<n>The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning.
arXiv Detail & Related papers (2026-02-19T18:17:25Z)
Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities [42.55442413239192]
We study how to ensure that AI systems are aligned with human values and remain safe.<n>The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s)<n>The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task.
arXiv Detail & Related papers (2025-12-29T14:47:05Z)
The AI off-switch problem as a signalling game: bounded rationality and incomparability [45.76759085727843]
We model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences to an AI agent. We show that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.
arXiv Detail & Related papers (2025-02-10T12:44:49Z)
On the consistent reasoning paradox of intelligence and optimal trust in AI: The power of 'I don't know' [79.69412622010249]
Consistent reasoning, which lies at the core of human intelligence, is the ability to handle tasks that are equivalent. CRP asserts that consistent reasoning implies fallibility -- in particular, human-like intelligence in AI necessarily comes with human-like fallibility.
arXiv Detail & Related papers (2024-08-05T10:06:53Z)
What Else Do I Need to Know? The Effect of Background Information on Users' Reliance on QA Systems [23.69129423040988]
We study how users interact with QA systems in the absence of sufficient information to assess their predictions. Our study reveals that users rely on model predictions even in the absence of sufficient information needed to assess the model's correctness.
arXiv Detail & Related papers (2023-05-23T17:57:12Z)
Alterfactual Explanations -- The Relevance of Irrelevance for Explaining AI Systems [0.9542023122304099]
We argue that in order to fully understand a decision, not only knowledge about relevant features is needed, but that the awareness of irrelevant information also highly contributes to the creation of a user's mental model of an AI system. Our approach, which we call Alterfactual Explanations, is based on showing an alternative reality where irrelevant features of an AI's input are altered. We show that alterfactual explanations are suited to convey an understanding of different aspects of the AI's reasoning than established counterfactual explanation methods.
arXiv Detail & Related papers (2022-07-19T16:20:37Z)
On Avoiding Power-Seeking by Artificial Intelligence [93.9264437334683]
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.
arXiv Detail & Related papers (2022-06-23T16:56:21Z)
On the Effect of Information Asymmetry in Human-AI Teams [0.0]
We focus on the existence of complementarity potential between humans and AI. Specifically, we identify information asymmetry as an essential source of complementarity potential. By conducting an online experiment, we demonstrate that humans can use such contextual information to adjust the AI's decision.
arXiv Detail & Related papers (2022-05-03T13:02:50Z)
On the Influence of Explainable AI on Automation Bias [0.0]
We aim to shed light on the potential to influence automation bias by explainable AI (XAI) We conduct an online experiment with regard to hotel review classifications and discuss first results.
arXiv Detail & Related papers (2022-04-19T12:54:23Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z)
The Threat of Offensive AI to Organizations [52.011307264694665]
This survey explores the threat of offensive AI on organizations. First, we discuss how AI changes the adversary's methods, strategies, goals, and overall attack model. Then, through a literature review, we identify 33 offensive AI capabilities which adversaries can use to enhance their attacks.
arXiv Detail & Related papers (2021-06-30T01:03:28Z)
Does Explainable Artificial Intelligence Improve Human Decision-Making? [17.18994675838646]
We compare and evaluate objective human decision accuracy without AI (control), with an AI prediction (no explanation) and AI prediction with explanation. We find any kind of AI prediction tends to improve user decision accuracy, but no conclusive evidence that explainable AI has a meaningful impact. Our results indicate that, at least in some situations, the "why" information provided in explainable AI may not enhance user decision-making.
arXiv Detail & Related papers (2020-06-19T15:46:13Z)
Aligning Superhuman AI with Human Behavior: Chess as a Model System [5.236087378443016]
We develop Maia, a customized version of Alpha-Zero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines. For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines.
arXiv Detail & Related papers (2020-06-02T18:12:52Z)
Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork [54.309495231017344]
We argue that AI systems should be trained in a human-centered manner, directly optimized for team performance. We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves. Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance.
arXiv Detail & Related papers (2020-04-27T19:06:28Z)
Signaling in Bayesian Network Congestion Games: the Subtle Power of Symmetry [66.82463322411614]
The paper focuses on the problem of optimal ex ante persuasive signaling schemes, showing that symmetry is a crucial property for its solution. We show that an optimal ex ante persuasive scheme can be computed in time when players are symmetric and have affine cost functions.
arXiv Detail & Related papers (2020-02-12T19:38:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.