AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
- URL: http://arxiv.org/abs/2407.12784v1
- Date: Wed, 17 Jul 2024 17:59:47 GMT
- Title: AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
- Authors: Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li,
- Abstract summary: We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
- Score: 73.04652687616286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.
Related papers
- DeCE: Deceptive Cross-Entropy Loss Designed for Defending Backdoor Attacks [26.24490960002264]
We propose a general and effective loss function DeCE (Deceptive Cross-Entropy) to enhance the security of Code Language Models.
Our experiments across various code synthesis datasets, models, and poisoning ratios demonstrate the applicability and effectiveness of DeCE.
arXiv Detail & Related papers (2024-07-12T03:18:38Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.
InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools.
We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z) - Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z) - Evil Geniuses: Delving into the Safety of LLM-based Agents [35.49857256840015]
Large language models (LLMs) have revitalized in large language models (LLMs)
This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
arXiv Detail & Related papers (2023-11-20T15:50:09Z) - Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception.
We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception.
We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z) - Demystifying Poisoning Backdoor Attacks from a Statistical Perspective [35.30533879618651]
Backdoor attacks pose a significant security risk due to their stealthy nature and potentially serious consequences.
This paper evaluates the effectiveness of any backdoor attack incorporating a constant trigger.
Our derived understanding applies to both discriminative and generative models.
arXiv Detail & Related papers (2023-10-16T19:35:01Z) - Backdoor Attacks Against Incremental Learners: An Empirical Evaluation
Study [79.33449311057088]
This paper empirically reveals the high vulnerability of 11 typical incremental learners against poisoning-based backdoor attack on 3 learning scenarios.
The defense mechanism based on activation clustering is found to be effective in detecting our trigger pattern to mitigate potential security risks.
arXiv Detail & Related papers (2023-05-28T09:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.