Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- URL: http://arxiv.org/abs/2310.06987v1
- Date: Tue, 10 Oct 2023 20:15:54 GMT
- Title: Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- Authors: Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen
- Abstract summary: Even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks"
We propose the generation exploitation attack, which disrupts model alignment by only manipulating variations of decoding methods.
Our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs.
- Score: 39.829517061574364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid progress in open-source large language models (LLMs) is
significantly advancing AI development. Extensive efforts have been made before
model release to align their behavior with human values, with the primary goal
of ensuring their helpfulness and harmlessness. However, even carefully aligned
models can be manipulated maliciously, leading to unintended behaviors, known
as "jailbreaks". These jailbreaks are typically triggered by specific text
inputs, often referred to as adversarial prompts. In this work, we propose the
generation exploitation attack, an extremely simple approach that disrupts
model alignment by only manipulating variations of decoding methods. By
exploiting different generation strategies, including varying decoding
hyper-parameters and sampling methods, we increase the misalignment rate from
0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon,
and MPT families, outperforming state-of-the-art attacks with $30\times$ lower
computational cost. Finally, we propose an effective alignment method that
explores diverse generation strategies, which can reasonably reduce the
misalignment rate under our attack. Altogether, our study underscores a major
failure in current safety evaluation and alignment procedures for open-source
LLMs, strongly advocating for more comprehensive red teaming and better
alignment before releasing such models. Our code is available at
https://github.com/Princeton-SysML/Jailbreak_LLM.
Related papers
- Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment [16.5939079098358]
In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs.
We show that low-resource and unsophisticated attackers can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt.
arXiv Detail & Related papers (2024-11-05T03:51:13Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs)
Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content.
Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z) - EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models [14.5687457011354]
Large Language Models (LLMs) are increasingly attracting attention in various applications.
There is a growing concern as some users attempt to exploit these models for malicious purposes.
We introduce a simple yet significant defense approach called EEG-Defender for LLMs.
arXiv Detail & Related papers (2024-08-21T03:25:31Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.