Related papers: Model evaluation for extreme risks

Model evaluation for extreme risks

URL: http://arxiv.org/abs/2305.15324v2
Date: Fri, 22 Sep 2023 18:48:42 GMT
Title: Model evaluation for extreme risks
Authors: Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe
Abstract summary: Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks.
Score: 46.53170857607407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Related papers

Adapting Probabilistic Risk Assessment for AI [0.0]
General-purpose artificial intelligence (AI) systems present an urgent risk management challenge. Current methods often rely on selective testing and undocumented assumptions about risk priorities. This paper introduces the probabilistic risk assessment (PRA) for AI framework.
arXiv Detail & Related papers (2025-04-25T17:59:14Z)
What Makes an Evaluation Useful? Common Pitfalls and Best Practices [3.4740704830599385]
We discuss the steps of the initial thought process, which connects threat modeling to evaluation design. We provide the characteristics and parameters that make an evaluation useful.
arXiv Detail & Related papers (2025-03-30T12:51:47Z)
Fully Autonomous AI Agents Should Not be Developed [58.88624302082713]
This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels. Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z)
What AI evaluations for preventing catastrophic risks can and cannot do [2.07180164747172]
We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm. This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
arXiv Detail & Related papers (2024-11-26T18:00:36Z)
Sabotage Evaluations for Frontier Models [48.23262570766321]
Sufficiently capable models could subvert human oversight and decision-making in important contexts. We develop a set of related threat models and evaluations. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
arXiv Detail & Related papers (2024-10-28T20:34:51Z)
Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization. We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z)
EAIRiskBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [47.69642609574771]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction. Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results. However, the deployment of these agents in physical environments presents significant safety challenges. This study introduces EAIRiskBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustworthy capability evaluations are crucial for ensuring the safety of AI systems. Developers of AI systems may have incentives for evaluations to understate the AI's actual capability. In this paper we assess sandbagging capabilities in contemporary language models.
arXiv Detail & Related papers (2024-06-11T15:26:57Z)
Generative AI Models: Opportunities and Risks for Industry and Authorities [1.3914994102950027]
Generative AI models are capable of performing a wide range of tasks that traditionally require creativity and human understanding. They learn patterns from existing data during training and can subsequently generate new content. The use of generative AI models introduces novel IT security risks that need to be considered.
arXiv Detail & Related papers (2024-06-07T08:34:30Z)
Risks and Opportunities of Open-Source Generative AI [64.86989162783648]
Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation. This regulation is likely to put at risk the budding field of open-source generative AI.
arXiv Detail & Related papers (2024-05-14T13:37:36Z)
Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z)
Asset-centric Threat Modeling for AI-based Systems [7.696807063718328]
This paper presents ThreatFinderAI, an approach and tool to model AI-related assets, threats, countermeasures, and quantify residual risks. To evaluate the practicality of the approach, participants were tasked to recreate a threat model developed by cybersecurity experts of an AI-based healthcare platform. Overall, the solution's usability was well-perceived and effectively supports threat identification and risk discussion.
arXiv Detail & Related papers (2024-03-11T08:40:01Z)
Sociotechnical Safety Evaluation of Generative AI Systems [13.546708226350963]
Generative AI systems produce a range of risks. To ensure the safety of generative AI systems, these risks must be evaluated. We propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks.
arXiv Detail & Related papers (2023-10-18T14:13:58Z)
Quantitative AI Risk Assessments: Opportunities and Challenges [9.262092738841979]
AI-based systems are increasingly being leveraged to provide value to organizations, individuals, and society. Risks have led to proposed regulations, litigation, and general societal concerns. This paper explores the concept of a quantitative AI Risk Assessment.
arXiv Detail & Related papers (2022-09-13T21:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.