AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
- URL: http://arxiv.org/abs/2503.17388v1
- Date: Mon, 17 Mar 2025 17:56:43 GMT
- Title: AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
- Authors: Dillon Bowen, Ann-Kathrin Dombrowski, Adam Gleave, Chris Cundy,
- Abstract summary: frontier AI companies should report both pre- and post-mitigation safety evaluations.<n> evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards.
- Score: 5.984437476321095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies' safety claims effectively.
Related papers
- Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds [0.0]
We show how "safety revisionism" has allowed AI technologists to engage in "safety revisionism"
We explore how the current trajectory for AI risk determination and evaluation for foundation model use within national security is poised for a race to the bottom.
arXiv Detail & Related papers (2025-04-21T13:20:56Z) - In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI [93.33036653316591]
We call for three interventions to advance system safety.<n>First, we propose using standardized AI flaw reports and rules of engagement for researchers.<n>Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs.<n>Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports.
arXiv Detail & Related papers (2025-03-21T05:09:46Z) - Securing External Deeper-than-black-box GPAI Evaluations [49.1574468325115]
This paper examines the critical challenges and potential solutions for conducting secure and effective external evaluations of general-purpose AI (GPAI) models.<n>With the exponential growth in size, capability, reach and accompanying risk, ensuring accountability, safety, and public trust requires frameworks that go beyond traditional black-box methods.
arXiv Detail & Related papers (2025-03-10T16:13:45Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.50078821423793]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system.<n>This raises the question of what level of confidence should be associated with a top-level claim.<n>We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z) - Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation [2.07180164747172]
We argue that regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations.
We identify core assumptions in AI evaluations, such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation.
Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.
arXiv Detail & Related papers (2024-11-19T19:13:56Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.<n>We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - Affirmative safety: An approach to risk management for high-risk AI [6.133009503054252]
We argue that entities developing or deploying high-risk AI systems should be required to present evidence of affirmative safety.
We propose a risk management approach for advanced AI in which model developers must provide evidence that their activities keep certain risks below regulator-set thresholds.
arXiv Detail & Related papers (2024-04-14T20:48:55Z) - A Safe Harbor for AI Evaluation and Red Teaming [124.89885800509505]
Some researchers fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal.
We propose that major AI developers commit to providing a legal and technical safe harbor.
We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.
arXiv Detail & Related papers (2024-03-07T20:55:08Z) - Sociotechnical Safety Evaluation of Generative AI Systems [13.546708226350963]
Generative AI systems produce a range of risks.
To ensure the safety of generative AI systems, these risks must be evaluated.
We propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks.
arXiv Detail & Related papers (2023-10-18T14:13:58Z) - Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.