What AI evaluations for preventing catastrophic risks can and cannot do
- URL: http://arxiv.org/abs/2412.08653v1
- Date: Tue, 26 Nov 2024 18:00:36 GMT
- Title: What AI evaluations for preventing catastrophic risks can and cannot do
- Authors: Peter Barnett, Lisa Thiergart,
- Abstract summary: We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm.
This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
- Score: 2.07180164747172
- License:
- Abstract: AI evaluations are an important component of the AI governance toolkit, underlying current approaches to safety cases for preventing catastrophic risks. Our paper examines what these evaluations can and cannot tell us. Evaluations can establish lower bounds on AI capabilities and assess certain misuse risks given sufficient effort from evaluators. Unfortunately, evaluations face fundamental limitations that cannot be overcome within the current paradigm. These include an inability to establish upper bounds on capabilities, reliably forecast future model capabilities, or robustly assess risks from autonomous AI systems. This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe. We conclude with recommendations for incremental improvements to frontier AI safety, while acknowledging these fundamental limitations remain unsolved.
Related papers
- Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation [2.07180164747172]
We argue that regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations.
We identify core assumptions in AI evaluations, such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation.
Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.
arXiv Detail & Related papers (2024-11-19T19:13:56Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.
Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.
However, the deployment of these agents in physical environments presents significant safety challenges.
This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Reasons to Doubt the Impact of AI Risk Evaluations [0.0]
This paper asks whether evaluations significantly improve our understanding of AI risks and our ability to mitigate those risks.
It concludes with considerations for improving evaluation practices and 12 recommendations for AI labs, external evaluators, regulators, and academic researchers.
arXiv Detail & Related papers (2024-08-05T15:42:51Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.
We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI.
The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees.
We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z) - Managing extreme AI risks amid rapid progress [171.05448842016125]
We describe risks that include large-scale social harms, malicious uses, and irreversible loss of human control over autonomous AI systems.
There is a lack of consensus about how exactly such risks arise, and how to manage them.
Present governance initiatives lack the mechanisms and institutions to prevent misuse and recklessness, and barely address autonomous systems.
arXiv Detail & Related papers (2023-10-26T17:59:06Z) - Sociotechnical Safety Evaluation of Generative AI Systems [13.546708226350963]
Generative AI systems produce a range of risks.
To ensure the safety of generative AI systems, these risks must be evaluated.
We propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks.
arXiv Detail & Related papers (2023-10-18T14:13:58Z) - Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z) - Quantitative AI Risk Assessments: Opportunities and Challenges [7.35411010153049]
Best way to reduce risks is to implement comprehensive AI lifecycle governance.
Risks can be quantified using metrics from the technical community.
This paper explores these issues, focusing on the opportunities, challenges, and potential impacts of such an approach.
arXiv Detail & Related papers (2022-09-13T21:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.