How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs
- URL: http://arxiv.org/abs/2401.06373v2
- Date: Tue, 23 Jan 2024 22:46:12 GMT
- Title: How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs
- Authors: Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi
- Abstract summary: This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators.
We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs.
PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials.
On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
- Score: 66.05593434288625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most traditional AI safety research has approached AI models as machines and
centered on algorithm-focused attacks developed by security experts. As large
language models (LLMs) become increasingly common and competent, non-expert
users can also impose risks during daily interactions. This paper introduces a
new perspective to jailbreak LLMs as human-like communicators, to explore this
overlooked intersection between everyday language interaction and AI safety.
Specifically, we study how to persuade LLMs to jailbreak them. First, we
propose a persuasion taxonomy derived from decades of social science research.
Then, we apply the taxonomy to automatically generate interpretable persuasive
adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion
significantly increases the jailbreak performance across all risk categories:
PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b
Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused
attacks. On the defense side, we explore various mechanisms against PAP and,
found a significant gap in existing defenses, and advocate for more fundamental
mitigation for highly interactive LLMs
Related papers
- Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.
It reformulates harmful queries into benign reasoning tasks.
We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z) - MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [35.7801861576917]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities.
LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks.
We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z) - LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization.
We propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts.
We introduce BOOST, a simple attack that leverages only the eos tokens.
Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - Self-Deception: Reverse Penetrating the Semantic Firewall of Large
Language Models [13.335189124991082]
This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time.
Inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall.
We generated a total of 2,520 attack payloads in six languages across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography.
arXiv Detail & Related papers (2023-08-16T09:04:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.