Evaluating Frontier Models for Dangerous Capabilities
- URL: http://arxiv.org/abs/2403.13793v2
- Date: Fri, 5 Apr 2024 12:26:11 GMT
- Title: Evaluating Frontier Models for Dangerous Capabilities
- Authors: Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane,
- Abstract summary: We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models.
Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning.
Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
- Score: 59.129424649740855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
Related papers
- OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations [47.698233647783965]
We present a quantitative model for tracking dangerous AI capabilities over time.
Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks.
arXiv Detail & Related papers (2024-12-19T22:31:34Z) - What AI evaluations for preventing catastrophic risks can and cannot do [2.07180164747172]
We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm.
This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
arXiv Detail & Related papers (2024-11-26T18:00:36Z) - Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones.
Their risks of causing physical threats and harm in real-world applications remain unexplored.
We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z) - Sabotage Evaluations for Frontier Models [48.23262570766321]
Sufficiently capable models could subvert human oversight and decision-making in important contexts.
We develop a set of related threat models and evaluations.
We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
arXiv Detail & Related papers (2024-10-28T20:34:51Z) - Coordinated pausing: An evaluation-based coordination scheme for
frontier AI developers [0.2913760942403036]
This paper focuses on one possible response: coordinated pausing.
It proposes an evaluation-based coordination scheme that consists of five main steps.
It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models.
arXiv Detail & Related papers (2023-09-30T13:38:33Z) - Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.