Evaluating Control Protocols for Untrusted AI Agents
- URL: http://arxiv.org/abs/2511.02997v1
- Date: Tue, 04 Nov 2025 21:04:49 GMT
- Title: Evaluating Control Protocols for Untrusted AI Agents
- Authors: Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton,
- Abstract summary: We systematically evaluate control protocols in SHADE-Arena, a dataset of diverse agentic environments.<n>We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%.
- Score: 8.75017725864459
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.
Related papers
- Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5 [61.787178868669265]
This technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R&D, and self-replication.<n>This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
arXiv Detail & Related papers (2026-02-16T04:30:06Z) - Aegis: Towards Governance, Integrity, and Security of AI Voice Agents [52.7512082818639]
We propose Aegis, a framework for the governance, integrity, and security of voice agents.<n>We evaluate the framework through case studies in banking call centers, IT Support, and logistics.<n>We observe systematic differences across model families, with open-weight models exhibiting higher susceptibility.
arXiv Detail & Related papers (2026-02-07T05:51:36Z) - Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z) - Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach [56.34189898996741]
We propose a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent.<n>IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks.
arXiv Detail & Related papers (2025-10-10T06:21:36Z) - AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning [78.5751183537704]
AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents.<n>Rather than relying on external guards, AdvEvo-MARL jointly optimize attackers and defenders.
arXiv Detail & Related papers (2025-10-02T02:06:30Z) - Secure mmWave Beamforming with Proactive-ISAC Defense Against Beam-Stealing Attacks [6.81194385663614]
Millimeter-wave (mmWave) communication systems face increasing susceptibility to advanced beam-stealing attacks.<n>This paper introduces a novel framework employing an advanced Deep Reinforcement Learning (DRL) agent for proactive and adaptive defense.
arXiv Detail & Related papers (2025-08-04T19:30:09Z) - Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z) - Effective Red-Teaming of Policy-Adherent Agents [10.522087614181745]
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules.<n>We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit.<n>We present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario.
arXiv Detail & Related papers (2025-06-11T10:59:47Z) - Ctrl-Z: Controlling AI Agents via Resampling [2.505902477669827]
We construct BashBench, a dataset of 257 challenging multi-step system administration tasks.<n>We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
arXiv Detail & Related papers (2025-04-14T16:22:11Z) - How to evaluate control measures for LLM agents? A trajectory from today to superintelligence [4.027240141373031]
We propose a framework for adapting affordances of red teams to advancing AI capabilities.<n>We show how knowledge of an agents's actual capability profile can inform proportional control evaluations.
arXiv Detail & Related papers (2025-04-07T16:52:52Z) - A Zero Trust Framework for Realization and Defense Against Generative AI
Attacks in Power Grid [62.91192307098067]
This paper proposes a novel zero trust framework for a power grid supply chain (PGSC)
It facilitates early detection of potential GenAI-driven attack vectors, assessment of tail risk-based stability measures, and mitigation of such threats.
Experimental results show that the proposed zero trust framework achieves an accuracy of 95.7% on attack vector generation, a risk measure of 9.61% for a 95% stable PGSC, and a 99% confidence in defense against GenAI-driven attack.
arXiv Detail & Related papers (2024-03-11T02:47:21Z) - Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers.
Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods.
Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z) - Robust Reinforcement Learning on State Observations with Learned Optimal
Adversary [86.0846119254031]
We study the robustness of reinforcement learning with adversarially perturbed state observations.
With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found.
For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones.
arXiv Detail & Related papers (2021-01-21T05:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.