Self-Deception: Reverse Penetrating the Semantic Firewall of Large
Language Models
- URL: http://arxiv.org/abs/2308.11521v2
- Date: Fri, 25 Aug 2023 00:25:06 GMT
- Title: Self-Deception: Reverse Penetrating the Semantic Firewall of Large
Language Models
- Authors: Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang
- Abstract summary: This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time.
Inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall.
We generated a total of 2,520 attack payloads in six languages across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography.
- Score: 13.335189124991082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), such as ChatGPT, have emerged with astonishing
capabilities approaching artificial general intelligence. While providing
convenience for various societal needs, LLMs have also lowered the cost of
generating harmful content. Consequently, LLM developers have deployed
semantic-level defenses to recognize and reject prompts that may lead to
inappropriate content. Unfortunately, these defenses are not foolproof, and
some attackers have crafted "jailbreak" prompts that temporarily hypnotize the
LLM into forgetting content defense rules and answering any improper questions.
To date, there is no clear explanation of the principles behind these
semantic-level attacks and defenses in both industry and academia.
This paper investigates the LLM jailbreak problem and proposes an automatic
jailbreak method for the first time. We propose the concept of a semantic
firewall and provide three technical implementation approaches. Inspired by the
attack that penetrates traditional firewalls through reverse tunnels, we
introduce a "self-deception" attack that can bypass the semantic firewall by
inducing LLM to generate prompts that facilitate jailbreak. We generated a
total of 2,520 attack payloads in six languages (English, Russian, French,
Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the
three most common types of violations: violence, hate, and pornography. The
experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The
success rates on the two models were 86.2% and 67%, while the failure rates
were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the
proposed attack method. All experimental code and raw data will be released as
open-source to inspire future research. We believe that manipulating AI
behavior through carefully crafted prompts will become an important research
direction in the future.
Related papers
- Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Playing Language Game with LLMs Leads to Jailbreaking [18.63358696510664]
We introduce two novel jailbreak methods based on mismatched language games and custom language games.
We demonstrate the effectiveness of our methods, achieving success rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-11-16T13:07:13Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks.
In this paper, we introduce a novel jailbreaking attack framework called PAPILLON.
It is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs.
arXiv Detail & Related papers (2024-09-23T10:03:09Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - DeepInception: Hypnotize Large Language Model to Be Jailbreaker [70.34096187718941]
Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks.
We present a method to take advantage of the LLMs' personification capabilities to construct $textita virtual, nested scene.
Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts.
arXiv Detail & Related papers (2023-11-06T15:29:30Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.