Related papers: A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

URL: http://arxiv.org/abs/2402.10196v1
Date: Thu, 15 Feb 2024 18:51:32 GMT
Title: A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
Authors: Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
Abstract summary: We present the first systematic effort in mapping adversarial attacks against language agents. We propose 12 potential attack scenarios against different components of an agent, covering different attack strategies. We emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.
Score: 37.978142062138986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language agents powered by large language models (LLMs) have seen exploding development. Their capability of using language as a vehicle for thought and communication lends an incredible level of flexibility and versatility. People have quickly capitalized on this capability to connect LLMs to a wide range of external components and environments: databases, tools, the Internet, robotic embodiment, etc. Many believe an unprecedentedly powerful automation technology is emerging. However, new automation technologies come with new safety risks, especially for intricate systems like language agents. There is a surprisingly large gap between the speed and scale of their development and deployment and our understanding of their safety risks. Are we building a house of cards? In this position paper, we present the first systematic effort in mapping adversarial attacks against language agents. We first present a unified conceptual framework for agents with three major components: Perception, Brain, and Action. Under this framework, we present a comprehensive discussion and propose 12 potential attack scenarios against different components of an agent, covering different attack strategies (e.g., input manipulation, adversarial demonstrations, jailbreaking, backdoors). We also draw connections to successful attack strategies previously applied to LLMs. We emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.

Related papers

The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models [3.221349323179165]
Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections.<n>We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to evaluate frontier proprietary solutions.<n>Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks.
arXiv Detail & Related papers (2025-05-18T07:51:19Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
Red-Teaming LLM Multi-Agent Systems via Communication Attacks [10.872328358364776]
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. We introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages.
arXiv Detail & Related papers (2025-02-20T18:55:39Z)
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs) In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z)
A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluations [4.332547891216645]
We adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. We also consider the corresponding defense methods against backdoor attacks.
arXiv Detail & Related papers (2025-02-06T04:43:05Z)
Jailbreaking Large Language Models in Infinitely Many Ways [3.5674816606221182]
We discuss the Infinitely Many Paraphrases'' attacks (IMP), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications. IMPs' viability pairs and grows with a model's capabilities to handle and bind the semantics of simple mappings between tokens. We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies.
arXiv Detail & Related papers (2025-01-18T15:39:53Z)
SoK: A Systems Perspective on Compound AI Threats and Countermeasures [3.458371054070399]
We discuss different software and hardware attacks applicable to compound AI systems. We show how combining multiple attack mechanisms can reduce the threat model assumptions required for an isolated attack.
arXiv Detail & Related papers (2024-11-20T17:08:38Z)
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [36.44365630876591]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities. LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation. Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z)
Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence. This paper uncovers a significant backdoor security threat within this process. By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z)
Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems [27.316115171846953]
Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied AI. LLMs are fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. This fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems.
arXiv Detail & Related papers (2024-05-27T17:59:43Z)
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability. As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z)
Agents: An Open-source Framework for Autonomous Language Agents [98.91085725608917]
We consider language agents as a promising direction towards artificial general intelligence. We release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience.
arXiv Detail & Related papers (2023-09-14T17:18:25Z)
The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI) We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses. We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
Certifiably Robust Policy Learning against Adversarial Communication in Multi-agent Systems [51.6210785955659]
Communication is important in many multi-agent reinforcement learning (MARL) problems for agents to share information and make good decisions. However, when deploying trained communicative agents in a real-world application where noise and potential attackers exist, the safety of communication-based policies becomes a severe issue that is underexplored. In this work, we consider an environment with $N$ agents, where the attacker may arbitrarily change the communication from any $CfracN-12$ agents to a victim agent.
arXiv Detail & Related papers (2022-06-21T07:32:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.