Related papers: BashArena: A Control Setting for Highly Privileged AI Agents

BashArena: A Control Setting for Highly Privileged AI Agents

URL: http://arxiv.org/abs/2512.15688v1
Date: Wed, 17 Dec 2025 18:45:25 GMT
Title: BashArena: A Control Setting for Highly Privileged AI Agents
Authors: Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, Aryan Bhatt,
Abstract summary: We introduce BashArena, a setting for studying AI control techniques in security-critical environments.<n>BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments.<n>We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts.
Score: 1.685485565763117
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.

Related papers

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D [4.230181169227057]
We investigate the capabilities of AI agents to act against the interests of their users when conducting machine learning (ML) engineering.<n>We extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures.<n>We use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors.
arXiv Detail & Related papers (2025-11-13T03:02:36Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors [18.075392981522565]
We show how attackers can remotely exploit these systems by poisoning external development resources with malicious instructions.<n>We implement AIShellJack, an automated testing framework for assessing prompt injection vulnerabilities in agentic AI coding editors.<n>Our evaluation results show that attack success rates can reach as high as 84% for executing malicious commands.
arXiv Detail & Related papers (2025-09-26T08:20:54Z)
Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE [64.47951172662745]
Cuckoo Attack is a novel attack that achieves stealthy and persistent command execution by embedding malicious payloads into configuration files.<n>We formalize our attack paradigm into two stages, including initial infection and persistence.<n>We contribute seven actionable checkpoints for vendors to evaluate their product security.
arXiv Detail & Related papers (2025-09-19T04:10:52Z)
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z)
STACK: Adversarial Attacks on LLM Safeguard Pipelines [5.784929232265091]
Anthropic guards latest Claude 4 Opus model using one such defense pipeline.<n>Other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses.<n>We address this gap by developing and red-teaming an open-source defense pipeline.
arXiv Detail & Related papers (2025-06-30T17:21:08Z)
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents [8.02267424051267]
Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings.<n>We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks.<n>We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena.
arXiv Detail & Related papers (2025-06-17T15:46:15Z)
Raij\=u: Reinforcement Learning-Guided Post-Exploitation for Automating Security Assessment of Network Systems [0.0]
Raij=u framework is a Reinforcement Learning-driven automation approach. We implement two RL algorithms to train specialized agents capable of making intelligent actions. Agents achieve over 84% of successful attacks with under 55 attack steps given.
arXiv Detail & Related papers (2023-09-27T09:36:22Z)
A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z)
Automating Privilege Escalation with Deep Reinforcement Learning [71.87228372303453]
In this work, we exemplify the potential threat of malicious actors using deep reinforcement learning to train automated agents. We present an agent that uses a state-of-the-art reinforcement learning algorithm to perform local privilege escalation. Our agent is usable for generating realistic attack sensor data for training and evaluating intrusion detection systems.
arXiv Detail & Related papers (2021-10-04T12:20:46Z)
The Feasibility and Inevitability of Stealth Attacks [63.14766152741211]
We study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence systems. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself.
arXiv Detail & Related papers (2021-06-26T10:50:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.