Related papers: Multi-Agent Systems Execute Arbitrary Malicious Code

Related papers

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z)
BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents [58.83028403414688]
Large language model (LLM) agents execute tasks through multi-step workflow that combine planning, memory, and tool use.<n>Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs.<n>We propose textbfBackdoorAgent, a modular and stage-aware framework that provides a unified agent-centric view of backdoor threats in LLM agents.
arXiv Detail & Related papers (2026-01-08T03:49:39Z)
Securing the Model Context Protocol (MCP): Risks, Controls, and Governance [1.4072883206858737]
We focus on three types of adversaries that take advantage of MCP s flexibility.<n>Based on early incidents and proof-of-concept attacks, we describe how MCP can increase the attack surface.<n>We propose a set of practical controls, including per-user authentication with scoped authorization.
arXiv Detail & Related papers (2025-11-25T23:24:26Z)
Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting [5.544819942438653]
We present SafeAgents, a framework for fine-grained security assessment of multi-agent systems.<n>We conduct a study across five widely adopted multi-agent architectures.<n>Our findings reveal that common design patterns carry significant vulnerabilities.
arXiv Detail & Related papers (2025-11-14T04:22:49Z)
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents [70.24175620901538]
Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters.<n>Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios.<n>We propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents.
arXiv Detail & Related papers (2025-10-02T22:59:06Z)
A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks [1.1435139523855764]
This paper presents a novel multi-agent defense framework to detect and neutralize prompt injection attacks in real-time.<n>We evaluate our approach using two distinct architectures: a sequential chain-of-agents pipeline and a hierarchical coordinator-based system.
arXiv Detail & Related papers (2025-09-16T19:11:28Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability [5.483452240835409]
BlockA2A is a unified multi-agent trust framework for agent-to-agent interoperability.<n>It eliminates centralized trust bottlenecks, ensures message authenticity and execution integrity, and guarantees accountability across agent interactions.<n>It neutralizes attacks through real-time mechanisms, including Byzantine agent flagging, reactive execution halting, and instant permission revocation.
arXiv Detail & Related papers (2025-08-02T11:59:21Z)
Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree [8.511846002129522]
We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior.<n>Our system demonstrates high success rates across real websites in both targeted and general attacks.
arXiv Detail & Related papers (2025-07-20T03:10:13Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z)
Context manipulation attacks : Web agents are susceptible to corrupted memory [37.66661108936654]
"Plan injection" is a novel context manipulation attack that corrupts these agents' internal task representations by targeting this vulnerable context.<n>We show that plan injections bypass robust prompt injection defenses, achieving up to 3x higher attack success rates than comparable prompt-based attacks.<n>Our findings highlight that secure memory handling must be a first-class concern in agentic systems.
arXiv Detail & Related papers (2025-06-18T14:29:02Z)
Demonstrations of Integrity Attacks in Multi-Agent Systems [7.640342064257848]
Multi-Agent Systems (MAS) could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality.<n>This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits.
arXiv Detail & Related papers (2025-06-05T02:44:49Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
LlamaFirewall: An open source guardrail system for building secure AI agents [0.5603362829699733]
Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks.<n>Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor.<n>We introduce LlamaFirewall, an open-source security focused guardrail framework.
arXiv Detail & Related papers (2025-05-06T14:34:21Z)
SAGA: A Security Architecture for Governing AI Agentic Systems [13.106925341037046]
Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents. We propose SAGA, a Security Architecture for Governing Agentic systems, that offers user oversight over their agents' lifecycle.
arXiv Detail & Related papers (2025-04-27T23:10:00Z)
Manipulating Multimodal Agents via Cross-Modal Prompt Injection [34.35145839873915]
We identify a critical yet previously overlooked security vulnerability in multimodal agents. We propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities. Our method outperforms existing injection attacks, achieving at least a +26.4% increase in attack success rates.
arXiv Detail & Related papers (2025-04-19T16:28:03Z)
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats [84.94654617852322]
We present DoomArena, a security evaluation framework for AI agents. It is a plug-in framework and integrates easily into realistic agentic frameworks. It is modular and decouples the development of attacks from details of the environment in which the agent is deployed.
arXiv Detail & Related papers (2025-04-18T20:36:10Z)
Progent: Programmable Privilege Control for LLM Agents [46.49787947705293]
We introduce Progent, the first privilege control mechanism for LLM agents. At its core is a domain-specific language for flexibly expressing privilege control policies applied during agent execution. This enables agent developers and users to craft suitable policies for their specific use cases and enforce them deterministically to guarantee security.
arXiv Detail & Related papers (2025-04-16T01:58:40Z)
Defeating Prompt Injections by Design [79.00910871948787]
CaMeL is a robust defense that creates a protective system layer around the Large Language Models (LLMs) To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query. We demonstrate effectiveness of CaMeL by solving $67%$ of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.
arXiv Detail & Related papers (2025-03-24T15:54:10Z)
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents [60.92962583528122]
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer.<n>We uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs)<n>MIPs adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs.
arXiv Detail & Related papers (2025-03-13T18:59:12Z)
Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach [9.483655213280738]
This paper presents a novel approach to evaluating the security of large language models (LLMs)<n>We define prompt leakage as a critical threat to secure LLM deployment.<n>We implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt.
arXiv Detail & Related papers (2025-02-18T08:17:32Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored. We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse. We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems [6.480532634073257]
We introduce Prompt Infection, a novel attack where malicious prompts self-replicate across interconnected agents. This attack poses severe threats, including data theft, scams, misinformation, and system-wide disruption. To address this, we propose LLM Tagging, a defense mechanism that, when combined with existing safeguards, significantly mitigates infection spread.
arXiv Detail & Related papers (2024-10-09T11:01:29Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>However, the impact of clumsy or even malicious agents, on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z)
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications. We show how attackers can override original instructions and employed controls using Prompt Injection attacks. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.