Related papers: Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

URL: http://arxiv.org/abs/2512.13526v1
Date: Mon, 15 Dec 2025 16:56:28 GMT
Title: Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Authors: Asa Cooper Stickland, Jan Michelfeit, Arathi Mani, Charlie Griffin, Ollie Matthews, Tomek Korbak, Rogan Inglis, Oliver Makins, Alan Cooney,
Abstract summary: We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact.<n>Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs.<n>We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents)
Score: 2.7146936326590425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based software engineering agents are increasingly used in real-world development tasks, often with access to sensitive data or security-critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer's internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor's false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: https://github.com/UKGovernmentBEIS/async-control.

Related papers

Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence [1.1637186977447433]
We show that we can achieve reliability without acquiring perfect components, but through careful orchestration of imperfect ones.<n>This paper describes the architecture of such a system in practice: specialized agent teams (planners, executors, critics, experts)<n>We demonstrate the approach achieves over 90% internal error interception prior to user exposure while maintaining acceptable latency tradeoffs.
arXiv Detail & Related papers (2026-01-20T17:19:09Z)
CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D [4.230181169227057]
We investigate the capabilities of AI agents to act against the interests of their users when conducting machine learning (ML) engineering.<n>We extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures.<n>We use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors.
arXiv Detail & Related papers (2025-11-13T03:02:36Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Reliable Weak-to-Strong Monitoring of LLM Agents [6.922769543581406]
We stress test monitoring systems for detecting covert misbehavior in autonomous agents.<n>We develop a monitor red teaming (MRT) workflow that incorporates varying levels of agent and monitor situational awareness.<n>We release code, data, and logs to spur further research.
arXiv Detail & Related papers (2025-08-26T22:29:31Z)
AgentSight: System-Level Observability for AI Agents Using eBPF [10.37440633887049]
Existing tools observe either an agent's high-level intent (via LLM prompts) or its low-level actions (e.g., system calls) but cannot correlate these two views.<n>We introduce AgentSight, an AgentOps observability framework that bridges this semantic gap using a hybrid approach.<n>AgentSight intercepts TLS-encrypted LLM traffic to extract semantic intent, monitors kernel events to observe system-wide effects, and causally correlates these two streams across process boundaries.
arXiv Detail & Related papers (2025-08-02T01:43:39Z)
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents [52.92354372596197]
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities.<n>This interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior.<n>We propose a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control and data-level constraints.
arXiv Detail & Related papers (2025-06-13T05:01:09Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents. We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.