Related papers: Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

URL: http://arxiv.org/abs/2602.05066v1
Date: Wed, 04 Feb 2026 21:38:38 GMT
Title: Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
Authors: Jafar Isbarov, Murat Kantarcioglu,
Abstract summary: Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent.<n>We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a- Proxy attack.<n>Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
Score: 12.356708678431183
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

Related papers

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z)
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents [60.98294016925157]
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss.<n>We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content.<n>Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks.
arXiv Detail & Related papers (2026-01-14T23:06:35Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems [0.42970700836450487]
This paper proposes a novel architectural framework aimed at enhancing security and reliability in multi-agent systems (MAS)<n>A central component of this framework is a network of Sentinel Agents, functioning as a distributed security layer.<n>Such agents can potentially oversee inter-agent communications, identify potential threats, enforce privacy and access controls, and maintain comprehensive audit records.
arXiv Detail & Related papers (2025-09-18T13:39:59Z)
Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms [31.01865239234458]
In this paper, we evaluate the robustness of agentic systems against attacks that aim to elicit harmful actions from agents.<n>We propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS.<n> BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions.
arXiv Detail & Related papers (2025-08-22T15:53:22Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
Demonstrations of Integrity Attacks in Multi-Agent Systems [7.640342064257848]
Multi-Agent Systems (MAS) could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality.<n>This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits.
arXiv Detail & Related papers (2025-06-05T02:44:49Z)
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception. We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception. We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.