Related papers: An alignment safety case sketch based on debate

An alignment safety case sketch based on debate

URL: http://arxiv.org/abs/2505.03989v3
Date: Fri, 23 May 2025 16:45:17 GMT
Title: An alignment safety case sketch based on debate
Authors: Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving,
Abstract summary: One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate.<n>This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work.
Score: 3.2504831918078168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.

Related papers

AI Debate Aids Assessment of Controversial Claims [86.47978525513236]
We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial COVID-19 factuality claims.<n>In our human study, we find that debate-where two AI advisor systems present opposing evidence-based arguments-consistently improves judgment accuracy and confidence calibration.<n>In our AI judge study, we find that AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%)
arXiv Detail & Related papers (2025-06-02T19:01:53Z)
Taking AI Welfare Seriously [0.5617572524191751]
We argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously.
arXiv Detail & Related papers (2024-11-04T17:57:57Z)
Towards evaluations-based safety cases for AI scheming [37.399946932069746]
We propose three arguments that safety cases could use in relation to scheming. First, developers of frontier AI systems could argue that AI systems are not capable of scheming. Second, one could argue that AI systems are not capable of posing harm through scheming. Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them.
arXiv Detail & Related papers (2024-10-29T17:55:29Z)
On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z)
Work-in-Progress: Crash Course: Can (Under Attack) Autonomous Driving Beat Human Drivers? [60.51287814584477]
This paper evaluates the inherent risks in autonomous driving by examining the current landscape of AVs. We develop specific claims highlighting the delicate balance between the advantages of AVs and potential security challenges in real-world scenarios.
arXiv Detail & Related papers (2024-05-14T09:42:21Z)
Artificial Intelligence: Arguments for Catastrophic Risk [0.0]
We review two influential arguments purporting to show how AI could pose catastrophic risks. The first argument -- the Problem of Power-Seeking -- claims that advanced AI systems are likely to engage in dangerous power-seeking behavior. The second argument claims that the development of human-level AI will unlock rapid further progress.
arXiv Detail & Related papers (2024-01-27T19:34:13Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety [0.0]
Highly Reliable Agent Designs (HRAD) is one of the most controversial and ambitious approaches. We have titled the arguments (1) incidental utility, (2) deconfusion, (3) precise specification, and (4) prediction. We have explained the assumptions and claims based on a review of published and informal literature, along with experts who have stated positions on the topic.
arXiv Detail & Related papers (2022-01-09T07:42:37Z)
Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z)
AI Failures: A Review of Underlying Issues [0.0]
We focus on AI failures on account of flaws in conceptualization, design and deployment. We find that AI systems fail on account of omission and commission errors in the design of the AI system. An AI system is quite likely to fail in situations where, in effect, it is called upon to deliver moral judgments.
arXiv Detail & Related papers (2020-07-18T15:31:29Z)
On Adversarial Examples and Stealth Attacks in Artificial Intelligence Systems [62.997667081978825]
We present a formal framework for assessing and analyzing two classes of malevolent action towards generic Artificial Intelligence (AI) systems. The first class involves adversarial examples and concerns the introduction of small perturbations of the input data that cause misclassification. The second class, introduced here for the first time and named stealth attacks, involves small perturbations to the AI system itself.
arXiv Detail & Related papers (2020-04-09T10:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.