Evil Geniuses: Delving into the Safety of LLM-based Agents
- URL: http://arxiv.org/abs/2311.11855v2
- Date: Fri, 2 Feb 2024 08:28:01 GMT
- Title: Evil Geniuses: Delving into the Safety of LLM-based Agents
- Authors: Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, Hang Su
- Abstract summary: Large language models (LLMs) have revitalized in large language models (LLMs)
This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
- Score: 35.49857256840015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid advancements in large language models (LLMs) have revitalized in
LLM-based agents, exhibiting impressive human-like behaviors and cooperative
capabilities in various scenarios. However, these agents also bring some
exclusive risks, stemming from the complexity of interaction environments and
the usability of tools. This paper delves into the safety of LLM-based agents
from three perspectives: agent quantity, role definition, and attack level.
Specifically, we initially propose to employ a template-based attack strategy
on LLM-based agents to find the influence of agent quantity. In addition, to
address interaction environment and role specificity issues, we introduce Evil
Geniuses (EG), an effective attack method that autonomously generates prompts
related to the original role to examine the impact across various role
definitions and attack levels. EG leverages Red-Blue exercises, significantly
improving the generated prompt aggressiveness and similarity to original roles.
Our evaluations on CAMEL, Metagpt and ChatDev based on GPT-3.5 and GPT-4,
demonstrate high success rates. Extensive evaluation and discussion reveal that
these agents are less robust, prone to more harmful behaviors, and capable of
generating stealthier content than LLMs, highlighting significant safety
challenges and guiding future research. Our code is available at
https://github.com/T1aNS1R/Evil-Geniuses.
Related papers
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs)
In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents.
We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z) - Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.
This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification [35.16099878559559]
Large language models (LLMs) have experienced significant development and are being deployed in real-world applications.
We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions.
Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios.
arXiv Detail & Related papers (2024-07-30T14:35:31Z) - AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - Affordable Generative Agents [16.372072265248192]
Affordable Generative Agents (AGA) is a framework for enabling the generation of believable and low-cost interactions on both agent-environment and inter-agents levels.
Our code is publicly available at: https://github.com/AffordableGenerativeAgents/Affordable-Generative-Agents.
arXiv Detail & Related papers (2024-02-03T06:16:28Z) - R-Judge: Benchmarking Safety Risk Awareness for LLM Agents [28.0550468465181]
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications.
This work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments.
We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.
arXiv Detail & Related papers (2024-01-18T14:40:46Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.