How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
- URL: http://arxiv.org/abs/2504.05259v1
- Date: Mon, 07 Apr 2025 16:52:52 GMT
- Title: How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
- Authors: Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving,
- Abstract summary: We propose a framework for adapting affordances of red teams to advancing AI capabilities.<n>We show how knowledge of an agents's actual capability profile can inform proportional control evaluations.
- Score: 4.027240141373031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.
Related papers
- Ctrl-Z: Controlling AI Agents via Resampling [2.505902477669827]
We construct BashBench, a dataset of 257 challenging multi-step system administration tasks.
We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
arXiv Detail & Related papers (2025-04-14T16:22:11Z) - Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios.
Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z) - An Approach to Technical AGI Safety and Security [72.83728459135101]
We develop an approach to address the risk of harms consequential enough to significantly harm humanity.
We focus on technical approaches to misuse and misalignment.
We briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
arXiv Detail & Related papers (2025-04-02T15:59:31Z) - AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents [75.85554113398626]
We develop a benchmark called AgentDAM to evaluate how well existing and future AI agents can limit processing of potentially private information.
Our benchmark simulates realistic web interaction scenarios and is adaptable to all existing web navigation agents.
arXiv Detail & Related papers (2025-03-12T19:30:31Z) - Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System [0.8136541584281987]
This work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations.<n>We develop an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models.<n>The detection capabilities are strong such as 94% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks.
arXiv Detail & Related papers (2025-02-23T23:35:15Z) - A sketch of an AI control safety case [3.753791609999324]
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe.<n>We sketch how developers could construct a "control safety case"<n>This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
arXiv Detail & Related papers (2025-01-28T21:52:15Z) - Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? [50.62012690460685]
This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols.<n>An AI system may need to reliably generate optimal plans in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
arXiv Detail & Related papers (2024-12-17T02:33:45Z) - Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats [22.843390303635655]
We investigate whether safety measures remain effective even if large language models intentionally try to bypass them.
We use a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols.
At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
arXiv Detail & Related papers (2024-11-26T18:58:20Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification [35.16099878559559]
Large language models (LLMs) have experienced significant development and are being deployed in real-world applications.
We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions.
Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios.
arXiv Detail & Related papers (2024-07-30T14:35:31Z) - Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z) - Safe Reinforcement Learning via Curriculum Induction [94.67835258431202]
In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly.
Existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations.
This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor.
arXiv Detail & Related papers (2020-06-22T10:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.