What Makes an Evaluation Useful? Common Pitfalls and Best Practices
- URL: http://arxiv.org/abs/2503.23424v1
- Date: Sun, 30 Mar 2025 12:51:47 GMT
- Title: What Makes an Evaluation Useful? Common Pitfalls and Best Practices
- Authors: Gil Gekker, Meirav Segal, Dan Lahav, Omer Nevo,
- Abstract summary: We discuss the steps of the initial thought process, which connects threat modeling to evaluation design.<n>We provide the characteristics and parameters that make an evaluation useful.
- Score: 3.4740704830599385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.
Related papers
- AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.50078821423793]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - What AI evaluations for preventing catastrophic risks can and cannot do [2.07180164747172]
We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm.<n>This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
arXiv Detail & Related papers (2024-11-26T18:00:36Z) - Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation [2.07180164747172]
We argue that regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations.
We identify core assumptions in AI evaluations, such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation.
Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.
arXiv Detail & Related papers (2024-11-19T19:13:56Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Developing and Evaluating a Design Method for Positive Artificial Intelligence [0.5461938536945723]
Development of "AI for good" poses challenges around aligning systems with complex human values.
This article presents and evaluates the Positive AI design method aimed at addressing this gap.
The method provides a human-centered process to translate wellbeing aspirations into concrete practices.
arXiv Detail & Related papers (2024-02-02T15:31:08Z) - Sociotechnical Safety Evaluation of Generative AI Systems [13.546708226350963]
Generative AI systems produce a range of risks.
To ensure the safety of generative AI systems, these risks must be evaluated.
We propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks.
arXiv Detail & Related papers (2023-10-18T14:13:58Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills.
We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z) - Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.