LlamaFirewall: An open source guardrail system for building secure AI agents
- URL: http://arxiv.org/abs/2505.03574v1
- Date: Tue, 06 May 2025 14:34:21 GMT
- Title: LlamaFirewall: An open source guardrail system for building secure AI agents
- Authors: Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe,
- Abstract summary: Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks.<n>Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor.<n>We introduce LlamaFirewall, an open-source security focused guardrail framework.
- Score: 0.5603362829699733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.
Related papers
- BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability [5.483452240835409]
BlockA2A is a unified multi-agent trust framework for agent-to-agent interoperability.<n>It eliminates centralized trust bottlenecks, ensures message authenticity and execution integrity, and guarantees accountability across agent interactions.<n>It neutralizes attacks through real-time mechanisms, including Byzantine agent flagging, reactive execution halting, and instant permission revocation.
arXiv Detail & Related papers (2025-08-02T11:59:21Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z) - LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv Detail & Related papers (2025-05-29T21:39:08Z) - The Hidden Dangers of Browsing AI Agents [0.0]
This paper presents a comprehensive security evaluation of such agents, focusing on systemic vulnerabilities across multiple architectural layers.<n>Our work outlines the first end-to-end threat model for browsing agents and provides actionable guidance for securing their deployment in real-world environments.
arXiv Detail & Related papers (2025-05-19T13:10:29Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - DoomArena: A framework for Testing AI Agents Against Evolving Security Threats [84.94654617852322]
We present DoomArena, a security evaluation framework for AI agents.<n>It is a plug-in framework and integrates easily into realistic agentic frameworks.<n>It is modular and decouples the development of attacks from details of the environment in which the agent is deployed.
arXiv Detail & Related papers (2025-04-18T20:36:10Z) - Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System [0.8136541584281987]
This work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations.<n>We develop an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models.<n>The detection capabilities are strong such as 94% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks.
arXiv Detail & Related papers (2025-02-23T23:35:15Z) - Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach [9.483655213280738]
This paper presents a novel approach to evaluating the security of large language models (LLMs)<n>We define prompt leakage as a critical threat to secure LLM deployment.<n>We implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt.
arXiv Detail & Related papers (2025-02-18T08:17:32Z) - The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents [6.829628038851487]
Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration.<n>In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions.<n>We propose a novel perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives.
arXiv Detail & Related papers (2024-12-21T16:17:48Z) - AdvAgent: Controllable Blackbox Red-teaming on Web Agents [22.682464365220916]
AdvAgent is a black-box red-teaming framework for attacking web agents.<n>It employs a reinforcement learning-based pipeline to train an adversarial prompter model.<n>With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability.
arXiv Detail & Related papers (2024-10-22T20:18:26Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning [79.07152553060601]
We propose GuardAgent, the first guardrail agent to protect the target agents by dynamically checking whether their actions satisfy given safety guard requests.<n>Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution.<n>We show that GuardAgent effectively moderates the violation actions for different types of agents on two benchmarks with over 98% and 83% guardrail accuracies.
arXiv Detail & Related papers (2024-06-13T14:49:26Z) - Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.