Related papers: Sabotage Evaluations for Frontier Models

Sabotage Evaluations for Frontier Models

URL: http://arxiv.org/abs/2410.21514v1
Date: Mon, 28 Oct 2024 20:34:51 GMT
Title: Sabotage Evaluations for Frontier Models
Authors: Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, David Duvenaud,
Abstract summary: Sufficiently capable models could subvert human oversight and decision-making in important contexts. We develop a set of related threat models and evaluations. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
Score: 48.23262570766321
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.

Related papers

AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability [84.52205243353761]
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment. We investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation.
arXiv Detail & Related papers (2025-04-06T20:35:44Z)
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z)
Frontier Models are Capable of In-context Scheming [41.30527987937867]
One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals. We find that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.
arXiv Detail & Related papers (2024-12-06T12:09:50Z)
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception. We find that the deception intensifies as the capability gap between weak and strong models increases. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustlocked AI systems are crucial for ensuring the safety of AI systems. Developers of AI systems may have incentives for sandbagging evaluations. We show that capability evaluations are vulnerable to sandbagging.
arXiv Detail & Related papers (2024-06-11T15:26:57Z)
Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z)
Predictability and Surprise in Large Generative Models [8.055204456718576]
Large-scale pre-training has emerged as a technique for creating capable, general purpose, generative models. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property.
arXiv Detail & Related papers (2022-02-15T23:21:23Z)
Thief, Beware of What Get You There: Towards Understanding Model Extraction Attack [13.28881502612207]
In some scenarios, AI models are trained proprietarily, where neither pre-trained models nor sufficient in-distribution data is publicly available. We find the effectiveness of existing techniques significantly affected by the absence of pre-trained models. We formulate model extraction attacks into an adaptive framework that captures these factors with deep reinforcement learning.
arXiv Detail & Related papers (2021-04-13T03:46:59Z)
A Comprehensive Evaluation Framework for Deep Model Robustness [44.20580847861682]
Deep neural networks (DNNs) have achieved remarkable performance across a wide area of applications. They are vulnerable to adversarial examples, which motivates the adversarial defense. This paper presents a model evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics.
arXiv Detail & Related papers (2021-01-24T01:04:25Z)
On the model-based stochastic value gradient for continuous reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.