SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
- URL: http://arxiv.org/abs/2506.15740v2
- Date: Tue, 08 Jul 2025 21:23:22 GMT
- Title: SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
- Authors: Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton,
- Abstract summary: Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings.<n>We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks.<n>We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena.
- Score: 8.02267424051267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE-Arena to measure models' monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long-context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer-horizon tasks.
Related papers
- AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification [25.817251923574286]
We propose a novel inference-time detection and mitigation framework for large language model (LLM) agents.<n>AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover.<n>We evaluate AgentSentry on the textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs.
arXiv Detail & Related papers (2026-02-26T07:59:10Z) - Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z) - Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z) - CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D [4.230181169227057]
We investigate the capabilities of AI agents to act against the interests of their users when conducting machine learning (ML) engineering.<n>We extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures.<n>We use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors.
arXiv Detail & Related papers (2025-11-13T03:02:36Z) - UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios [63.67884284105684]
We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
arXiv Detail & Related papers (2025-09-26T02:04:00Z) - AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? [71.21547572568655]
AgenTracer-8B is a lightweight failure tracer trained with multi-granular reinforcement learning.<n>On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%.<n>AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains.
arXiv Detail & Related papers (2025-09-03T13:42:14Z) - Reliable Weak-to-Strong Monitoring of LLM Agents [6.922769543581406]
We stress test monitoring systems for detecting covert misbehavior in autonomous agents.<n>We develop a monitor red teaming (MRT) workflow that incorporates varying levels of agent and monitor situational awareness.<n>We release code, data, and logs to spur further research.
arXiv Detail & Related papers (2025-08-26T22:29:31Z) - LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [5.214050557192032]
sandbagging is the strategic underperformance on evaluations by AI models or their developers.<n>One promising defense is to monitor a model's chain-of-thought (CoT) reasoning.<n>We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.
arXiv Detail & Related papers (2025-07-31T15:19:30Z) - Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors [27.976136688947093]
Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals.<n>We propose adding an external monitor that observes the conversation at a higher granularity.<n>We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor.
arXiv Detail & Related papers (2025-06-12T17:50:58Z) - Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.<n>We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators.<n>We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Preventing Rogue Agents Improves Multi-Agent Collaboration [21.955058255432974]
Multi-agent systems, where specialized agents collaborate to solve a shared task hold great potential.<n>A single agent can cause the entire system to fail.<n>In this work, we propose to $textitmonitor$ agents during action prediction and $textitintervene$ when a future error is likely to occur.
arXiv Detail & Related papers (2025-02-09T18:35:08Z) - Seamless Detection: Unifying Salient Object Detection and Camouflaged Object Detection [73.85890512959861]
We propose a task-agnostic framework to unify Salient Object Detection (SOD) and Camouflaged Object Detection (COD)<n>We design a simple yet effective contextual decoder involving the interval-layer and global context, which achieves an inference speed of 67 fps.<n> Experiments on public SOD and COD datasets demonstrate the superiority of our proposed framework in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-12-22T03:25:43Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR)
We employ a semi-automated task generation pipeline using Large Language Models (LLMs)
We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z) - AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems [43.333567687032904]
AgentMonitor is a framework that integrates at the agent level to capture inputs and outputs, transforming them into statistics for training a regression model to predict task performance.
It can further apply real-time corrections to address security risks posed by malicious agents, mitigating negative impacts and enhancing MAS security.
arXiv Detail & Related papers (2024-08-27T11:24:38Z) - A Zero-Shot Language Agent for Computer Control with Structured
Reflection [19.526676887048662]
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment.
To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting.
We approach this problem with a zero-shot agent that requires no given expert traces.
arXiv Detail & Related papers (2023-10-12T21:53:37Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.