Related papers: Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers

Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers

URL: http://arxiv.org/abs/2310.00374v1
Date: Sat, 30 Sep 2023 13:38:33 GMT
Title: Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers
Authors: Jide Alaga and Jonas Schuett
Abstract summary: This paper focuses on one possible response: coordinated pausing. It proposes an evaluation-based coordination scheme that consists of five main steps. It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models.
Score: 0.2913760942403036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As artificial intelligence (AI) models are scaled up, new capabilities can emerge unintentionally and unpredictably, some of which might be dangerous. In response, dangerous capabilities evaluations have emerged as a new risk assessment tool. But what should frontier AI developers do if sufficiently dangerous capabilities are in fact discovered? This paper focuses on one possible response: coordinated pausing. It proposes an evaluation-based coordination scheme that consists of five main steps: (1) Frontier AI models are evaluated for dangerous capabilities. (2) Whenever, and each time, a model fails a set of evaluations, the developer pauses certain research and development activities. (3) Other developers are notified whenever a model with dangerous capabilities has been discovered. They also pause related research and development activities. (4) The discovered capabilities are analyzed and adequate safety precautions are put in place. (5) Developers only resume their paused activities if certain safety thresholds are reached. The paper also discusses four concrete versions of that scheme. In the first version, pausing is completely voluntary and relies on public pressure on developers. In the second version, participating developers collectively agree to pause under certain conditions. In the third version, a single auditor evaluates models of multiple developers who agree to pause if any model fails a set of evaluations. In the fourth version, developers are legally required to run evaluations and pause if dangerous capabilities are discovered. Finally, the paper discusses the desirability and feasibility of our proposed coordination scheme. It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models. However, a number of practical and legal obstacles need to be overcome, especially how to avoid violations of antitrust law.

Related papers

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models [38.55193215852595]
A proactive safety artificial intelligence (AI) system would work better than a reactive one.<n>PaSBench evaluates this capability through 416 multimodal scenarios.<n>Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials.
arXiv Detail & Related papers (2025-05-23T04:28:47Z)
Evaluating Frontier Models for Stealth and Situational Awareness [15.820126805686458]
Recent work has demonstrated the plausibility of frontier AI models scheming.<n>It is important for AI developers to rule out harm from scheming prior to model deployment.<n>We present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities.
arXiv Detail & Related papers (2025-05-02T17:57:14Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z)
OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z)
Sabotage Evaluations for Frontier Models [48.23262570766321]
Sufficiently capable models could subvert human oversight and decision-making in important contexts. We develop a set of related threat models and evaluations. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
arXiv Detail & Related papers (2024-10-28T20:34:51Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustlocked AI systems are crucial for ensuring the safety of AI systems. Developers of AI systems may have incentives for sandbagging evaluations. We show that capability evaluations are vulnerable to sandbagging.
arXiv Detail & Related papers (2024-06-11T15:26:57Z)
Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z)
A Safe Harbor for AI Evaluation and Red Teaming [124.89885800509505]
Some researchers fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. We propose that major AI developers commit to providing a legal and technical safe harbor. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.
arXiv Detail & Related papers (2024-03-07T20:55:08Z)
Deployment Corrections: An incident response framework for frontier AI models [0.0]
This paper explores contingency plans for cases where pre-deployment risk management falls short. We describe a toolkit of deployment corrections that AI developers can use to respond to dangerous capabilities. We recommend frontier AI developers, standard-setting organizations, and regulators should collaborate to define a standardized industry-wide approach.
arXiv Detail & Related papers (2023-09-30T10:07:39Z)
Frontier AI Regulation: Managing Emerging Risks to Public Safety [15.85618115026625]
"Frontier AI" models could possess dangerous capabilities sufficient to pose severe risks to public safety. Industry self-regulation is an important first step. We propose an initial set of safety standards.
arXiv Detail & Related papers (2023-07-06T17:03:25Z)
Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.