Model evaluation for extreme risks
- URL: http://arxiv.org/abs/2305.15324v2
- Date: Fri, 22 Sep 2023 18:48:42 GMT
- Title: Model evaluation for extreme risks
- Authors: Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess
Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus
Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins,
Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul
Christiano, Allan Dafoe
- Abstract summary: Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
- Score: 46.53170857607407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current approaches to building general-purpose AI systems tend to produce
systems with both beneficial and harmful capabilities. Further progress in AI
development could lead to capabilities that pose extreme risks, such as
offensive cyber capabilities or strong manipulation skills. We explain why
model evaluation is critical for addressing extreme risks. Developers must be
able to identify dangerous capabilities (through "dangerous capability
evaluations") and the propensity of models to apply their capabilities for harm
(through "alignment evaluations"). These evaluations will become critical for
keeping policymakers and other stakeholders informed, and for making
responsible decisions about model training, deployment, and security.
Related papers
- Fully Autonomous AI Agents Should Not be Developed [58.88624302082713]
This paper argues that fully autonomous AI agents should not be developed.
In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels.
Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z) - What AI evaluations for preventing catastrophic risks can and cannot do [2.07180164747172]
We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm.
This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
arXiv Detail & Related papers (2024-11-26T18:00:36Z) - Sabotage Evaluations for Frontier Models [48.23262570766321]
Sufficiently capable models could subvert human oversight and decision-making in important contexts.
We develop a set of related threat models and evaluations.
We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
arXiv Detail & Related papers (2024-10-28T20:34:51Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustworthy capability evaluations are crucial for ensuring the safety of AI systems.
Developers of AI systems may have incentives for evaluations to understate the AI's actual capability.
In this paper we assess sandbagging capabilities in contemporary language models.
arXiv Detail & Related papers (2024-06-11T15:26:57Z) - Generative AI Models: Opportunities and Risks for Industry and Authorities [1.3196892898418466]
Generative AI models are capable of performing a wide variety of tasks that have traditionally required creativity and human understanding.
During training, they learn patterns from existing data and can subsequently generate new content.
Many risks associated with generative AI must be addressed during development or can only be influenced by the operating organisation.
arXiv Detail & Related papers (2024-06-07T08:34:30Z) - Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models.
Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning.
Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z) - Asset-centric Threat Modeling for AI-based Systems [7.696807063718328]
This paper presents ThreatFinderAI, an approach and tool to model AI-related assets, threats, countermeasures, and quantify residual risks.
To evaluate the practicality of the approach, participants were tasked to recreate a threat model developed by cybersecurity experts of an AI-based healthcare platform.
Overall, the solution's usability was well-perceived and effectively supports threat identification and risk discussion.
arXiv Detail & Related papers (2024-03-11T08:40:01Z) - Quantitative AI Risk Assessments: Opportunities and Challenges [7.35411010153049]
Best way to reduce risks is to implement comprehensive AI lifecycle governance.
Risks can be quantified using metrics from the technical community.
This paper explores these issues, focusing on the opportunities, challenges, and potential impacts of such an approach.
arXiv Detail & Related papers (2022-09-13T21:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.