Related papers: Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

URL: http://arxiv.org/abs/2409.07985v1
Date: Thu, 12 Sep 2024 12:30:07 GMT
Title: Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Authors: Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate,
Abstract summary: This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable game. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants.
Score: 52.40622903199512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce methods for finding optimal protocols in AI-Control Games, by reducing them to a set of zero-sum partially observable stochastic games. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. Finally, we demonstrate the utility of our formalism by showcasing improvements over empirical studies in existing settings, evaluating protocols in new settings, and analysing how modelling assumptions affect the safety and usefulness of protocols.

Related papers

Ctrl-Z: Controlling AI Agents via Resampling [2.505902477669827]
We construct BashBench, a dataset of 257 challenging multi-step system administration tasks. We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
arXiv Detail & Related papers (2025-04-14T16:22:11Z)
A Benchmark for Scalable Oversight Protocols [2.048226951354646]
We introduce a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols.
arXiv Detail & Related papers (2025-03-31T23:32:59Z)
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? [50.62012690460685]
This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols. An AI system may need to reliably generate optimal plans in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
arXiv Detail & Related papers (2024-12-17T02:33:45Z)
CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection [41.94295877935867]
We introduce a benchmark to assess the ability of Large Language Models to autonomously identify vulnerabilities in new cryptographic protocols. We created a dataset of novel, flawed, communication protocols and designed a method to automatically verify the vulnerabilities found by the AI agents.
arXiv Detail & Related papers (2024-11-20T14:16:55Z)
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality [58.87422943009375]
This paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code.
arXiv Detail & Related papers (2024-06-13T06:09:22Z)
Pessimism of the Will, Optimism of the Intellect: Fair Protocols with Malicious but Rational Agents [0.6990493129893112]
We present a game-based framework for the study of fairness protocols. It is based on the notion of strong secure equilibria, and leverages the conceptual and algorithmic toolbox of game theory.
arXiv Detail & Related papers (2024-05-29T10:15:36Z)
Inferring State Machine from the Protocol Implementation via Large Language Model [18.942047454890847]
We propose an innovative state machine inference approach powered by Large Language Models (LLMs) Our evaluation across six protocol implementations demonstrates the method's high efficacy, achieving an accuracy rate exceeding 90%. Our proposed method not only marks a significant step forward in accurate state machine inference but also opens new avenues for improving the security and reliability of protocol implementations.
arXiv Detail & Related papers (2024-05-01T08:46:36Z)
A Survey and Comparative Analysis of Security Properties of CAN Authentication Protocols [92.81385447582882]
The Controller Area Network (CAN) bus leaves in-vehicle communications inherently non-secure. This paper reviews and compares the 15 most prominent authentication protocols for the CAN bus. We evaluate protocols based on essential operational criteria that contribute to ease of implementation.
arXiv Detail & Related papers (2024-01-19T14:52:04Z)
ProAgent: Building Proactive Cooperative Agents with Large Language Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents. ProAgent can analyze the present state, and infer the intentions of teammates from observations. ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z)
Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees [123.0401978870009]
We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server.
arXiv Detail & Related papers (2022-05-24T04:03:07Z)
Towards Controllable Agent in MOBA Games with Generative Modeling [0.45687771576879593]
We propose novel methods to develop action controllable agent that behaves like a human. We devise a deep latent alignment neural network model for training agent, and a corresponding sampling algorithm for controlling an agent's action. Both simulated and online experiments in the game Honor of Kings demonstrate the efficacy of the proposed methods.
arXiv Detail & Related papers (2021-12-15T13:09:22Z)
Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method. PCL implicitly encodes semantic structures of the data into the learned embedding space. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.