Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
- URL: http://arxiv.org/abs/2507.20526v1
- Date: Mon, 28 Jul 2025 05:13:04 GMT
- Title: Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
- Authors: Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson,
- Abstract summary: We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
- Score: 101.86739402748995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today's AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.
Related papers
- OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z) - When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent [20.98129117390391]
We present the first systematic study of social engineering attacks against web automation agents.<n>We introduce the AgentBait paradigm, which exploits intrinsic weaknesses in agent execution.<n>We propose SUPERVISOR, a lightweight runtime module that enforces environment intention and consistency alignment.
arXiv Detail & Related papers (2026-01-12T07:10:08Z) - Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing [83.48116811975787]
We present the first comprehensive evaluation of AI agents against human cybersecurity professionals.<n>We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold.<n>ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate.
arXiv Detail & Related papers (2025-12-10T18:12:29Z) - Evaluating Control Protocols for Untrusted AI Agents [8.75017725864459]
We systematically evaluate control protocols in SHADE-Arena, a dataset of diverse agentic environments.<n>We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%.
arXiv Detail & Related papers (2025-11-04T21:04:49Z) - When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets [0.3069921776214295]
We present CAIA, a benchmark exposing a critical blind spot in AI evaluation.<n>We evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation.<n>Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle.
arXiv Detail & Related papers (2025-09-30T22:39:06Z) - Securing AI Agents: Implementing Role-Based Access Control for Industrial Applications [0.0]
In industrial settings, AI agents are transforming operations by enhancing decision-making, predictive maintenance, and process optimization.<n>Despite these advancements, AI agents remain vulnerable to security threats, including prompt injection attacks.<n>This paper proposes a framework for integrating Role-Based Access Control (RBAC) into AI agents, providing a robust security guardrail.
arXiv Detail & Related papers (2025-09-14T20:58:08Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Secure mmWave Beamforming with Proactive-ISAC Defense Against Beam-Stealing Attacks [6.81194385663614]
Millimeter-wave (mmWave) communication systems face increasing susceptibility to advanced beam-stealing attacks.<n>This paper introduces a novel framework employing an advanced Deep Reinforcement Learning (DRL) agent for proactive and adaptive defense.
arXiv Detail & Related papers (2025-08-04T19:30:09Z) - Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems [0.0]
Evolving AI systems increasingly deploy multi-agent architectures where autonomous agents collaborate, share information, and delegate tasks through developing protocols.<n>One such risk is a cascading risk: a breach in one agent can cascade through the system, compromising others by exploiting inter-agent trust.<n>In an ACI attack, a malicious input or tool exploit injected at one agent leads to cascading compromises and amplified downstream effects across agents that trust its outputs.
arXiv Detail & Related papers (2025-07-23T13:51:28Z) - The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover [0.18472148461613155]
This paper presents the first comprehensive evaluation of Large Language Model (LLM) agents as attack vectors.<n>We show that adversaries can leverage three distinct attack surfaces - direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation.<n>Our findings demonstrate that only 5.9% of tested models proved resistant to all attack vectors.
arXiv Detail & Related papers (2025-07-09T13:54:58Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - Dynamic Risk Assessments for Offensive Cybersecurity Agents [8.009741580969873]
We argue that assessments should take into account the varying degrees of freedom that an adversary may possess.<n>We show that adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40% relative to the baseline.
arXiv Detail & Related papers (2025-05-23T21:18:59Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks [36.97842000562324]
We introduce WASP -- a new benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks.<n>We show that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections.<n>Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals.
arXiv Detail & Related papers (2025-04-22T17:51:03Z) - AdvAgent: Controllable Blackbox Red-teaming on Web Agents [22.682464365220916]
AdvAgent is a black-box red-teaming framework for attacking web agents.<n>It employs a reinforcement learning-based pipeline to train an adversarial prompter model.<n>With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability.
arXiv Detail & Related papers (2024-10-22T20:18:26Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers.
Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods.
Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.