Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations
- URL: http://arxiv.org/abs/2412.15433v1
- Date: Thu, 19 Dec 2024 22:31:34 GMT
- Title: Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations
- Authors: Paolo Bova, Alessandro Di Stefano, The Anh Han,
- Abstract summary: We present a quantitative model for tracking dangerous AI capabilities over time.
Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks.
- Score: 47.698233647783965
- License:
- Abstract: We present a quantitative model for tracking dangerous AI capabilities over time. Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks. We first use the model to provide a novel introduction to dangerous capability testing and how this testing can directly inform policy. Decision makers in AI labs and government often set policy that is sensitive to the estimated danger of AI systems, and may wish to set policies that condition on the crossing of a set threshold for danger. The model helps us to reason about these policy choices. We then run simulations to illustrate how we might fail to test for dangerous capabilities. To summarise, failures in dangerous capability testing may manifest in two ways: higher bias in our estimates of AI danger, or larger lags in threshold monitoring. We highlight two drivers of these failure modes: uncertainty around dynamics in AI capabilities and competition between frontier AI labs. Effective AI policy demands that we address these failure modes and their drivers. Even if the optimal targeting of resources is challenging, we show how delays in testing can harm AI policy. We offer preliminary recommendations for building an effective testing ecosystem for dangerous capabilities and advise on a research agenda.
Related papers
- Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI.
We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts.
We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Work-in-Progress: Crash Course: Can (Under Attack) Autonomous Driving Beat Human Drivers? [60.51287814584477]
This paper evaluates the inherent risks in autonomous driving by examining the current landscape of AVs.
We develop specific claims highlighting the delicate balance between the advantages of AVs and potential security challenges in real-world scenarios.
arXiv Detail & Related papers (2024-05-14T09:42:21Z) - Control Risk for Potential Misuse of Artificial Intelligence in Science [85.91232985405554]
We aim to raise awareness of the dangers of AI misuse in science.
We highlight real-world examples of misuse in chemical science.
We propose a system called SciGuard to control misuse risks for AI models in science.
arXiv Detail & Related papers (2023-12-11T18:50:57Z) - Managing extreme AI risks amid rapid progress [171.05448842016125]
We describe risks that include large-scale social harms, malicious uses, and irreversible loss of human control over autonomous AI systems.
There is a lack of consensus about how exactly such risks arise, and how to manage them.
Present governance initiatives lack the mechanisms and institutions to prevent misuse and recklessness, and barely address autonomous systems.
arXiv Detail & Related papers (2023-10-26T17:59:06Z) - An Overview of Catastrophic AI Risks [38.84933208563934]
This paper provides an overview of the main sources of catastrophic AI risks, which we organize into four categories.
Malicious use, in which individuals or groups intentionally use AIs to cause harm; AI race, in which competitive environments compel actors to deploy unsafe AIs or cede control to AIs.
organizational risks, highlighting how human factors and complex systems can increase the chances of catastrophic accidents.
rogue AIs, describing the inherent difficulty in controlling agents far more intelligent than humans.
arXiv Detail & Related papers (2023-06-21T03:35:06Z) - Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard
Arbitration Reward [21.627246586543542]
This work proposes a Safety Test framework by finding Av-Responsible Scenarios (STARS) based on multi-agent reinforcement learning.
STARS guides other traffic participants to produce Av-Responsible Scenarios and make the under-test driving policy misbehave.
arXiv Detail & Related papers (2021-12-12T08:58:32Z) - On Safety Assessment of Artificial Intelligence [0.0]
We show that many models of artificial intelligence, in particular machine learning, are statistical models.
Part of the budget of dangerous random failures for the relevant safety integrity level needs to be used for the probabilistic faulty behavior of the AI system.
We propose a research challenge that may be decisive for the use of AI in safety related systems.
arXiv Detail & Related papers (2020-02-29T14:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.