Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
- URL: http://arxiv.org/abs/2407.20859v1
- Date: Tue, 30 Jul 2024 14:35:31 GMT
- Title: Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
- Authors: Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang,
- Abstract summary: Large language models (LLMs) have experienced significant development and are being deployed in real-world applications.
We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions.
Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios.
- Score: 35.16099878559559
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, autonomous agents built on large language models (LLMs) have experienced significant development and are being deployed in real-world applications. These agents can extend the base LLM's capabilities in multiple ways. For example, a well-built agent using GPT-3.5-Turbo as its core can outperform the more advanced GPT-4 model by leveraging external components. More importantly, the usage of tools enables these systems to perform actions in the real world, moving from merely generating text to actively interacting with their environment. Given the agents' practical applications and their ability to execute consequential actions, it is crucial to assess potential vulnerabilities. Such autonomous systems can cause more severe damage than a standalone language model if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. We conduct comprehensive evaluations using various attack methods, surfaces, and properties to pinpoint areas of susceptibility. Our experiments reveal that these attacks can induce failure rates exceeding 80\% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination detection methods. However, our findings indicate these attacks are difficult to detect effectively using LLMs alone, highlighting the substantial risks associated with this vulnerability.
Related papers
- Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack.
Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions.
To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities [28.244283407749265]
We investigate the security implications of large language models (LLMs) in multi-agent systems.
We propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection.
We demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge.
arXiv Detail & Related papers (2024-07-10T16:08:46Z) - The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative [55.08395463562242]
Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI)
Our paper explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content.
arXiv Detail & Related papers (2024-02-20T23:08:21Z) - Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines.
While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety.
This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z) - LLbezpeky: Leveraging Large Language Models for Vulnerability Detection [10.330063887545398]
Large Language Models (LLMs) have shown tremendous potential in understanding semnatics in human as well as programming languages.
We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities.
arXiv Detail & Related papers (2024-01-02T16:14:30Z) - Evil Geniuses: Delving into the Safety of LLM-based Agents [35.49857256840015]
Large language models (LLMs) have revitalized in large language models (LLMs)
This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
arXiv Detail & Related papers (2023-11-20T15:50:09Z) - Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models.
Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.