Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
- URL: http://arxiv.org/abs/2602.05066v1
- Date: Wed, 04 Feb 2026 21:38:38 GMT
- Title: Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
- Authors: Jafar Isbarov, Murat Kantarcioglu,
- Abstract summary: Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent.<n>We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a- Proxy attack.<n>Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
- Score: 12.356708678431183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
Related papers
- OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z) - CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents [60.98294016925157]
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss.<n>We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content.<n>Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks.
arXiv Detail & Related papers (2026-01-14T23:06:35Z) - Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z) - Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems [0.42970700836450487]
This paper proposes a novel architectural framework aimed at enhancing security and reliability in multi-agent systems (MAS)<n>A central component of this framework is a network of Sentinel Agents, functioning as a distributed security layer.<n>Such agents can potentially oversee inter-agent communications, identify potential threats, enforce privacy and access controls, and maintain comprehensive audit records.
arXiv Detail & Related papers (2025-09-18T13:39:59Z) - Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms [31.01865239234458]
In this paper, we evaluate the robustness of agentic systems against attacks that aim to elicit harmful actions from agents.<n>We propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS.<n> BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions.
arXiv Detail & Related papers (2025-08-22T15:53:22Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - Demonstrations of Integrity Attacks in Multi-Agent Systems [7.640342064257848]
Multi-Agent Systems (MAS) could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality.<n>This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits.
arXiv Detail & Related papers (2025-06-05T02:44:49Z) - MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z) - Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception.
We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception.
We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.