CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
- URL: http://arxiv.org/abs/2511.09904v2
- Date: Tue, 18 Nov 2025 06:58:19 GMT
- Title: CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
- Authors: Francis Rhys Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson,
- Abstract summary: We investigate the capabilities of AI agents to act against the interests of their users when conducting machine learning (ML) engineering.<n>We extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures.<n>We use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors.
- Score: 4.230181169227057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit
Related papers
- Async Control: Stress-testing Asynchronous Control Measures for LLM Agents [2.7146936326590425]
We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact.<n>Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs.<n>We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents)
arXiv Detail & Related papers (2025-12-15T16:56:28Z) - ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z) - Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z) - LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [3.1817491936340314]
sandbagging is the strategic underperformance on evaluations by AI models or their developers.<n>A promising defense is to monitor a model's chain-of-thought (CoT) reasoning.<n>We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.
arXiv Detail & Related papers (2025-07-31T15:19:30Z) - SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents [8.02267424051267]
Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings.<n>We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks.<n>We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena.
arXiv Detail & Related papers (2025-06-17T15:46:15Z) - RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? [6.861292004336852]
Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations.<n>This raises an important question: can models learn to evade such monitors?<n>We introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors.<n>We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust.
arXiv Detail & Related papers (2025-06-17T07:22:20Z) - Among Us: A Sandbox for Measuring and Detecting Agentic Deception [1.1893676124374688]
We introduce $textitAmong Us$, a social deception game where language-based agents exhibit long-term, open-ended deception.<n>We find that models trained with RL are comparatively much better at producing deception than detecting it.<n>We also find two SAE features that work well at deception detection but are unable to steer the model to lie less.
arXiv Detail & Related papers (2025-04-05T06:09:32Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations.
We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.