Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
- URL: http://arxiv.org/abs/2409.07985v1
- Date: Thu, 12 Sep 2024 12:30:07 GMT
- Title: Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
- Authors: Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate,
- Abstract summary: This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable game.
We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants.
- Score: 52.40622903199512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce methods for finding optimal protocols in AI-Control Games, by reducing them to a set of zero-sum partially observable stochastic games. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. Finally, we demonstrate the utility of our formalism by showcasing improvements over empirical studies in existing settings, evaluating protocols in new settings, and analysing how modelling assumptions affect the safety and usefulness of protocols.
Related papers
- Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol [12.46794226683161]
We develop a new protocol for evaluating model-free and model-based selectors.
We find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
arXiv Detail & Related papers (2025-02-11T23:40:55Z) - Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols [53.53802315778733]
Previous work evaluated protocols by subverting them with a human-AI red team, where an AI follows the human-written strategy.
This paper investigates how well AI systems can generate and act on strategies for subverting control protocols whilst operating without private memory.
arXiv Detail & Related papers (2024-12-17T02:33:45Z) - Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality [58.87422943009375]
This paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.
The protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module.
We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code.
arXiv Detail & Related papers (2024-06-13T06:09:22Z) - Pessimism of the Will, Optimism of the Intellect: Fair Protocols with Malicious but Rational Agents [0.6990493129893112]
We present a game-based framework for the study of fairness protocols.
It is based on the notion of strong secure equilibria, and leverages the conceptual and algorithmic toolbox of game theory.
arXiv Detail & Related papers (2024-05-29T10:15:36Z) - Inferring State Machine from the Protocol Implementation via Large Language Model [18.942047454890847]
State machines play a pivotal role in augmenting the efficacy of protocol analyzing to unveil more vulnerabilities.
Traditional methods based on dynamic analysis often overlook crucial state transitions due to limited coverage.
We propose an innovative state machine inference approach powered by Large Language Models (LLMs) named ProtocolGPT.
arXiv Detail & Related papers (2024-05-01T08:46:36Z) - A Survey and Comparative Analysis of Security Properties of CAN Authentication Protocols [92.81385447582882]
The Controller Area Network (CAN) bus leaves in-vehicle communications inherently non-secure.
This paper reviews and compares the 15 most prominent authentication protocols for the CAN bus.
We evaluate protocols based on essential operational criteria that contribute to ease of implementation.
arXiv Detail & Related papers (2024-01-19T14:52:04Z) - Byzantine-Robust Federated Learning with Optimal Statistical Rates and
Privacy Guarantees [123.0401978870009]
We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates.
We benchmark against competing protocols and show the empirical superiority of the proposed protocols.
Our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server.
arXiv Detail & Related papers (2022-05-24T04:03:07Z) - Towards Controllable Agent in MOBA Games with Generative Modeling [0.45687771576879593]
We propose novel methods to develop action controllable agent that behaves like a human.
We devise a deep latent alignment neural network model for training agent, and a corresponding sampling algorithm for controlling an agent's action.
Both simulated and online experiments in the game Honor of Kings demonstrate the efficacy of the proposed methods.
arXiv Detail & Related papers (2021-12-15T13:09:22Z) - Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method.
PCL implicitly encodes semantic structures of the data into the learned embedding space.
PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.