AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models
- URL: http://arxiv.org/abs/2310.15140v2
- Date: Thu, 14 Dec 2023 06:22:51 GMT
- Title: AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models
- Authors: Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang,
Furong Huang, Ani Nenkova, Tong Sun
- Abstract summary: Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
- Score: 55.748851471119906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safety alignment of Large Language Models (LLMs) can be compromised with
manual jailbreak attacks and (automatic) adversarial attacks. Recent studies
suggest that defending against these attacks is possible: adversarial attacks
generate unlimited but unreadable gibberish prompts, detectable by
perplexity-based filters; manual jailbreak attacks craft readable prompts, but
their limited number due to the necessity of human creativity allows for easy
blocking. In this paper, we show that these solutions may be too optimistic. We
introduce AutoDAN, an interpretable, gradient-based adversarial attack that
merges the strengths of both attack types. Guided by the dual goals of
jailbreak and readability, AutoDAN optimizes and generates tokens one by one
from left to right, resulting in readable prompts that bypass perplexity
filters while maintaining high attack success rates. Notably, these prompts,
generated from scratch using gradients, are interpretable and diverse, with
emerging strategies commonly seen in manual jailbreak attacks. They also
generalize to unforeseen harmful behaviors and transfer to black-box LLMs
better than their unreadable counterparts when using limited training data or a
single proxy model. Furthermore, we show the versatility of AutoDAN by
automatically leaking system prompts using a customized objective. Our work
offers a new way to red-team LLMs and understand jailbreak mechanisms via
interpretability.
Related papers
- Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [47.1955210785169]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.
Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.
We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent [24.487441771427434]
We propose a multi-agent LLM system named RedAgent to generate context-aware jailbreak prompts.
Our system can jailbreak most black-box LLMs in just five queries, improving the efficiency of existing red teaming methods by two times.
We have reported all found issues and communicated with OpenAI and Meta for bug fixes.
arXiv Detail & Related papers (2024-07-23T17:34:36Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.