Related papers: Safety case template for frontier AI: A cyber inability argument

Safety case template for frontier AI: A cyber inability argument

URL: http://arxiv.org/abs/2411.08088v1
Date: Tue, 12 Nov 2024 18:45:08 GMT
Title: Safety case template for frontier AI: A cyber inability argument
Authors: Arthur Goemans, Marie Davidsen Buhl, Jonas Schuett, Tomek Korbak, Jessica Wang, Benjamin Hilton, Geoffrey Irving,
Abstract summary: We propose a safety case template for offensive cyber capabilities. We identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results.
Score: 2.2628353000034065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.

Related papers

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models. We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system. This raises the question of what level of confidence should be associated with a top-level claim. We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z)
Dynamic safety cases for frontier AI [0.7538606213726908]
This paper proposes a Dynamic Safety Case Management System (DSCMS) to support both the initial creation of a safety case and its systematic, semi-automated revision over time. We demonstrate this approach on a safety case template for offensive cyber capabilities and suggest ways it can be integrated into governance structures for safety-critical decision-making.
arXiv Detail & Related papers (2024-12-23T14:43:41Z)
OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z)
Position: A taxonomy for reporting and describing AI security incidents [57.98317583163334]
We argue that specific are required to describe and report security incidents of AI systems. Existing frameworks for either non-AI security or generic AI safety incident reporting are insufficient to capture the specific properties of AI security.
arXiv Detail & Related papers (2024-12-19T13:50:26Z)
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation [0.3413711585591077]
As generative AI systems, including large language models (LLMs) and diffusion models, advance rapidly, their growing adoption has led to new and complex security risks. This paper introduces a novel formal framework for categorizing and mitigating these emergent security risks. We identify previously under-explored risks, including latent space exploitation, multi-modal cross-attack vectors, and feedback-loop-induced model degradation.
arXiv Detail & Related papers (2024-10-15T02:51:32Z)
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context. We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Cross-Modality Safety Alignment [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z)
Affirmative safety: An approach to risk management for high-risk AI [6.133009503054252]
We argue that entities developing or deploying high-risk AI systems should be required to present evidence of affirmative safety. We propose a risk management approach for advanced AI in which model developers must provide evidence that their activities keep certain risks below regulator-set thresholds.
arXiv Detail & Related papers (2024-04-14T20:48:55Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models. We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z)
Integrating Testing and Operation-related Quantitative Evidences in Assurance Cases to Argue Safety of Data-Driven AI/ML Components [2.064612766965483]
In the future, AI will increasingly find its way into systems that can potentially cause physical harm to humans. For such safety-critical systems, it must be demonstrated that their residual risk does not exceed what is acceptable. This paper proposes a more holistic argumentation structure for having achieved the target.
arXiv Detail & Related papers (2022-02-10T20:35:25Z)
The missing link: Developing a safety case for perception components in automated driving [10.43163823170716]
Perception is a key aspect of automated driving systems (AD) that relies heavily on Machine Learning (ML) Despite the known challenges with the safety assurance of ML-based components, proposals have recently emerged for unit-level safety cases addressing these components. We propose a generic template for such a linking argument specifically tailored for perception components.
arXiv Detail & Related papers (2021-08-30T15:12:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.