Evaluating Frontier Models for Dangerous Capabilities
- URL: http://arxiv.org/abs/2403.13793v2
- Date: Fri, 5 Apr 2024 12:26:11 GMT
- Title: Evaluating Frontier Models for Dangerous Capabilities
- Authors: Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane,
- Abstract summary: We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models.
Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning.
Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
- Score: 59.129424649740855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
Related papers
- Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models [0.0]
We argue that AI evaluations model should prioritize addressing high-consequence risks.
These risks could cause large-scale harm to the public, such as pandemics.
Scientists' experience with identifying and mitigating dual-use biological risks can help inform new approaches to evaluating biological AI models.
arXiv Detail & Related papers (2024-05-25T16:29:17Z) - Control Risk for Potential Misuse of Artificial Intelligence in Science [85.91232985405554]
We aim to raise awareness of the dangers of AI misuse in science.
We highlight real-world examples of misuse in chemical science.
We propose a system called SciGuard to control misuse risks for AI models in science.
arXiv Detail & Related papers (2023-12-11T18:50:57Z) - A Survey of Robustness and Safety of 2D and 3D Deep Learning Models
Against Adversarial Attacks [22.054275309336]
Deep learning models are not trustworthy enough because of their limited robustness against adversarial attacks.
We first construct a general threat model from different perspectives and then comprehensively review the latest progress of both 2D and 3D adversarial attacks.
We are the first to systematically investigate adversarial attacks for 3D models, a flourishing field applied to many real-world applications.
arXiv Detail & Related papers (2023-10-01T10:16:33Z) - Coordinated pausing: An evaluation-based coordination scheme for
frontier AI developers [0.2913760942403036]
This paper focuses on one possible response: coordinated pausing.
It proposes an evaluation-based coordination scheme that consists of five main steps.
It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models.
arXiv Detail & Related papers (2023-09-30T13:38:33Z) - Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z) - Safety-aware Motion Prediction with Unseen Vehicles for Autonomous
Driving [104.32241082170044]
We study a new task, safety-aware motion prediction with unseen vehicles for autonomous driving.
Unlike the existing trajectory prediction task for seen vehicles, we aim at predicting an occupancy map.
Our approach is the first one that can predict the existence of unseen vehicles in most cases.
arXiv Detail & Related papers (2021-09-03T13:33:33Z) - Certifiers Make Neural Networks Vulnerable to Availability Attacks [70.69104148250614]
We show for the first time that fallback strategies can be deliberately triggered by an adversary.
In addition to naturally occurring abstains for some inputs and perturbations, the adversary can use training-time attacks to deliberately trigger the fallback.
We design two novel availability attacks, which show the practical relevance of these threats.
arXiv Detail & Related papers (2021-08-25T15:49:10Z) - Don't Get Yourself into Trouble! Risk-aware Decision-Making for
Autonomous Vehicles [4.94950858749529]
We show that risk could be characterized by two components: 1) the probability of an undesirable outcome and 2) an estimate of how undesirable the outcome is (loss)
We developed a risk-based decision-making framework for the autonomous vehicle that integrates the high-level risk-based path planning with the reinforcement learning-based low-level control.
This work can improve safety by allowing an autonomous vehicle to one day avoid and react to risky situations.
arXiv Detail & Related papers (2021-06-08T18:24:02Z) - ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine
Learning Models [64.03398193325572]
Inference attacks against Machine Learning (ML) models allow adversaries to learn about training data, model parameters, etc.
We concentrate on four attacks - namely, membership inference, model inversion, attribute inference, and model stealing.
Our analysis relies on a modular re-usable software, ML-Doctor, which enables ML model owners to assess the risks of deploying their models.
arXiv Detail & Related papers (2021-02-04T11:35:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.