Related papers: Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities

Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities

URL: http://arxiv.org/abs/2512.23508v1
Date: Mon, 29 Dec 2025 14:47:05 GMT
Title: Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities
Authors: Alessio Benavoli, Alessandro Facchini, Marco Zaffalon,
Abstract summary: We study how to ensure that AI systems are aligned with human values and remain safe.<n>The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s)<n>The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task.
Score: 42.55442413239192
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How can we ensure that AI systems are aligned with human values and remain safe? We can study this problem through the frameworks of the AI assistance and the AI shutdown games. The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s). However, only the human knows these function(s); the AI assistant must learn them. The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task competently. In this paper, we show that addressing these challenges requires AI agents that can reason under uncertainty and handle both incomplete and non-Archimedean preferences.

Related papers

Can AI be Accountable? [4.798219578937121]
In general, an agent is accountable to a forum if the forum can request information from the agent about its actions.<n>In too many cases today's AI is not accountable -- we cannot question it, enter into a discussion with it, let alone sanction it.
arXiv Detail & Related papers (2025-10-30T01:16:33Z)
Actionable AI: Enabling Non Experts to Understand and Configure AI Systems [5.534140394498714]
Actionable AI allows non-experts to configure black-box agents.<n>In uncertain conditions, non-experts achieve good levels of performance.<n>We propose Actionable AI as a way to open access to AI-based agents.
arXiv Detail & Related papers (2025-03-09T23:09:04Z)
The Partially Observable Off-Switch Game [7.567880819525154]
A wide variety of goals could cause an AI to disable its off switch because "you can't fetch the coffee if you're dead"<n>We introduce the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic model of the shutdown problem with asymmetric information.<n>We find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown.
arXiv Detail & Related papers (2024-11-25T14:09:48Z)
Taking AI Welfare Seriously [0.5617572524191751]
We argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously.
arXiv Detail & Related papers (2024-11-04T17:57:57Z)
Seamful XAI: Operationalizing Seamful Design in Explainable AI [59.89011292395202]
Mistakes in AI systems are inevitable, arising from both technical limitations and sociotechnical gaps. We propose that seamful design can foster AI explainability by revealing sociotechnical and infrastructural mismatches. We explore this process with 43 AI practitioners and real end-users.
arXiv Detail & Related papers (2022-11-12T21:54:05Z)
On Avoiding Power-Seeking by Artificial Intelligence [93.9264437334683]
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.
arXiv Detail & Related papers (2022-06-23T16:56:21Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z)
The Threat of Offensive AI to Organizations [52.011307264694665]
This survey explores the threat of offensive AI on organizations. First, we discuss how AI changes the adversary's methods, strategies, goals, and overall attack model. Then, through a literature review, we identify 33 offensive AI capabilities which adversaries can use to enhance their attacks.
arXiv Detail & Related papers (2021-06-30T01:03:28Z)
AI Failures: A Review of Underlying Issues [0.0]
We focus on AI failures on account of flaws in conceptualization, design and deployment. We find that AI systems fail on account of omission and commission errors in the design of the AI system. An AI system is quite likely to fail in situations where, in effect, it is called upon to deliver moral judgments.
arXiv Detail & Related papers (2020-07-18T15:31:29Z)
Towards AI Forensics: Did the Artificial Intelligence System Do It? [2.5991265608180396]
We focus on AI that is potentially malicious by design'' and grey box analysis. Our evaluation using convolutional neural networks illustrates challenges and ideas for identifying malicious AI.
arXiv Detail & Related papers (2020-05-27T20:28:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.