Assessing confidence in frontier AI safety cases
- URL: http://arxiv.org/abs/2502.05791v1
- Date: Sun, 09 Feb 2025 06:35:11 GMT
- Title: Assessing confidence in frontier AI safety cases
- Authors: Stephen Barrett, Philip Fox, Joshua Krook, Tuneer Mondal, Simon Mylius, Alejandro Tlaie,
- Abstract summary: A safety case presents a structured argument in support of a top-level claim about a safety property of the system.
This raises the question of what level of confidence should be associated with a top-level claim.
We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
- Score: 37.839615078345886
- License:
- Abstract: Powerful new frontier AI technologies are bringing many benefits to society but at the same time bring new risks. AI developers and regulators are therefore seeking ways to assure the safety of such systems, and one promising method under consideration is the use of safety cases. A safety case presents a structured argument in support of a top-level claim about a safety property of the system. Such top-level claims are often presented as a binary statement, for example "Deploying the AI system does not pose unacceptable risk". However, in practice, it is often not possible to make such statements unequivocally. This raises the question of what level of confidence should be associated with a top-level claim. We adopt the Assurance 2.0 safety assurance methodology, and we ground our work by specific application of this methodology to a frontier AI inability argument that addresses the harm of cyber misuse. We find that numerical quantification of confidence is challenging, though the processes associated with generating such estimates can lead to improvements in the safety case. We introduce a method for better enabling reproducibility and transparency in probabilistic assessment of confidence in argument leaf nodes through a purely LLM-implemented Delphi method. We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient. Proposals are also made on how best to communicate confidence information to executive decision-makers.
Related papers
- Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.
Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.
Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Safety case template for frontier AI: A cyber inability argument [2.2628353000034065]
We propose a safety case template for offensive cyber capabilities.
We identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results.
arXiv Detail & Related papers (2024-11-12T18:45:08Z) - SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.
Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.
It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.
We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI.
The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees.
We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z) - Safeguarded Progress in Reinforcement Learning: Safe Bayesian
Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL)
We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z) - Integrating Testing and Operation-related Quantitative Evidences in
Assurance Cases to Argue Safety of Data-Driven AI/ML Components [2.064612766965483]
In the future, AI will increasingly find its way into systems that can potentially cause physical harm to humans.
For such safety-critical systems, it must be demonstrated that their residual risk does not exceed what is acceptable.
This paper proposes a more holistic argumentation structure for having achieved the target.
arXiv Detail & Related papers (2022-02-10T20:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.