Model evaluation for extreme risks
- URL: http://arxiv.org/abs/2305.15324v2
- Date: Fri, 22 Sep 2023 18:48:42 GMT
- Title: Model evaluation for extreme risks
- Authors: Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess
Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus
Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins,
Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul
Christiano, Allan Dafoe
- Abstract summary: Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
- Score: 46.53170857607407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current approaches to building general-purpose AI systems tend to produce
systems with both beneficial and harmful capabilities. Further progress in AI
development could lead to capabilities that pose extreme risks, such as
offensive cyber capabilities or strong manipulation skills. We explain why
model evaluation is critical for addressing extreme risks. Developers must be
able to identify dangerous capabilities (through "dangerous capability
evaluations") and the propensity of models to apply their capabilities for harm
(through "alignment evaluations"). These evaluations will become critical for
keeping policymakers and other stakeholders informed, and for making
responsible decisions about model training, deployment, and security.
Related papers
- Sabotage Evaluations for Frontier Models [48.23262570766321]
Sufficiently capable models could subvert human oversight and decision-making in important contexts.
We develop a set of related threat models and evaluations.
We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
arXiv Detail & Related papers (2024-10-28T20:34:51Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - EAIRiskBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [47.69642609574771]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EAIRiskBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Generative AI Models: Opportunities and Risks for Industry and Authorities [1.3914994102950027]
Generative AI models are capable of performing a wide range of tasks that traditionally require creativity and human understanding.
They learn patterns from existing data during training and can subsequently generate new content.
The use of generative AI models introduces novel IT security risks that need to be considered.
arXiv Detail & Related papers (2024-06-07T08:34:30Z) - Risks and Opportunities of Open-Source Generative AI [64.86989162783648]
Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education.
The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation.
This regulation is likely to put at risk the budding field of open-source generative AI.
arXiv Detail & Related papers (2024-05-14T13:37:36Z) - Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models.
Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning.
Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z) - Asset-centric Threat Modeling for AI-based Systems [7.696807063718328]
This paper presents ThreatFinderAI, an approach and tool to model AI-related assets, threats, countermeasures, and quantify residual risks.
To evaluate the practicality of the approach, participants were tasked to recreate a threat model developed by cybersecurity experts of an AI-based healthcare platform.
Overall, the solution's usability was well-perceived and effectively supports threat identification and risk discussion.
arXiv Detail & Related papers (2024-03-11T08:40:01Z) - Sociotechnical Safety Evaluation of Generative AI Systems [13.546708226350963]
Generative AI systems produce a range of risks.
To ensure the safety of generative AI systems, these risks must be evaluated.
We propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks.
arXiv Detail & Related papers (2023-10-18T14:13:58Z) - Quantitative AI Risk Assessments: Opportunities and Challenges [9.262092738841979]
AI-based systems are increasingly being leveraged to provide value to organizations, individuals, and society.
Risks have led to proposed regulations, litigation, and general societal concerns.
This paper explores the concept of a quantitative AI Risk Assessment.
arXiv Detail & Related papers (2022-09-13T21:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.