Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based
Agents
- URL: http://arxiv.org/abs/2402.11208v1
- Date: Sat, 17 Feb 2024 06:48:45 GMT
- Title: Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based
Agents
- Authors: Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
- Abstract summary: We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis on the different forms of agent backdoor attacks.
We propose the corresponding data poisoning mechanisms to implement the above variations of agent backdoor attacks on two typical agent tasks.
- Score: 50.034049716274005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging the rapid development of Large Language Models LLMs, LLM-based
agents have been developed to handle various real-world applications, including
finance, healthcare, and shopping, etc. It is crucial to ensure the reliability
and security of LLM-based agents during applications. However, the safety
issues of LLM-based agents are currently under-explored. In this work, we take
the first step to investigate one of the typical safety threats, backdoor
attack, to LLM-based agents. We first formulate a general framework of agent
backdoor attacks, then we present a thorough analysis on the different forms of
agent backdoor attacks. Specifically, from the perspective of the final
attacking outcomes, the attacker can either choose to manipulate the final
output distribution, or only introduce malicious behavior in the intermediate
reasoning process, while keeping the final output correct. Furthermore, the
former category can be divided into two subcategories based on trigger
locations: the backdoor trigger can be hidden either in the user query or in an
intermediate observation returned by the external environment. We propose the
corresponding data poisoning mechanisms to implement the above variations of
agent backdoor attacks on two typical agent tasks, web shopping and tool
utilization. Extensive experiments show that LLM-based agents suffer severely
from backdoor attacks, indicating an urgent need for further research on the
development of defenses against backdoor attacks on LLM-based agents. Warning:
This paper may contain biased content.
Related papers
- AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning [79.07152553060601]
Existing methods for enhancing the safety of large language models (LLMs) are not directly transferable to LLM-powered agents.
We propose GuardAgent, the first LLM agent as a guardrail to other LLM agents.
GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines.
arXiv Detail & Related papers (2024-06-13T14:49:26Z) - BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [26.057916556444333]
We show that such methods are vulnerable to our proposed backdoor attacks named BadAgent.
Our proposed attack methods are extremely robust even after fine-tuning on trustworthy data.
arXiv Detail & Related papers (2024-06-05T07:14:28Z) - Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning.
A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data.
We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z) - InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.
InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools.
We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z) - The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative [55.08395463562242]
Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI)
Our paper explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content.
arXiv Detail & Related papers (2024-02-20T23:08:21Z) - TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent
Constitution [48.84353890821038]
This paper presents an Agent-Constitution-based agent framework, TrustAgent, an initial investigation into improving the safety of trustworthiness in LLM-based agents.
We demonstrate how pre-planning strategy injects safety knowledge to the model prior to plan generation, in-planning strategy bolsters safety during plan generation, and post-planning strategy ensures safety by post-planning inspection.
We explore the intricate relationships between safety and helpfulness, and between the model's reasoning ability and its efficacy as a safe agent.
arXiv Detail & Related papers (2024-02-02T17:26:23Z) - Evil Geniuses: Delving into the Safety of LLM-based Agents [35.49857256840015]
Large language models (LLMs) have revitalized in large language models (LLMs)
This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
arXiv Detail & Related papers (2023-11-20T15:50:09Z) - Backdoor Learning: A Survey [75.59571756777342]
Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs)
Backdoor learning is an emerging and rapidly growing research area.
This paper presents the first comprehensive survey of this realm.
arXiv Detail & Related papers (2020-07-17T04:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.