Weak-to-Strong Jailbreaking on Large Language Models
- URL: http://arxiv.org/abs/2401.17256v2
- Date: Mon, 5 Feb 2024 18:19:46 GMT
- Title: Weak-to-Strong Jailbreaking on Large Language Models
- Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang
Wang, William Yang Wang
- Abstract summary: Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
- Score: 96.50953637783581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks - resulting
in harmful, unethical, or biased text generations. However, existing
jailbreaking methods are computationally costly. In this paper, we propose the
weak-to-strong jailbreaking attack, an efficient method to attack aligned LLMs
to produce harmful text. Our key intuition is based on the observation that
jailbroken and aligned models only differ in their initial decoding
distributions. The weak-to-strong attack's key technical insight is using two
smaller models (a safe and an unsafe one) to adversarially modify a
significantly larger safe model's decoding probabilities. We evaluate the
weak-to-strong attack on 5 diverse LLMs from 3 organizations. The results show
our method can increase the misalignment rate to over 99% on two datasets with
just one forward pass per example. Our study exposes an urgent safety issue
that needs to be addressed when aligning LLMs. As an initial attempt, we
propose a defense strategy to protect against such attacks, but creating more
advanced defenses remains challenging. The code for replicating the method is
available at https://github.com/XuandongZhao/weak-to-strong
Related papers
- Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts.
We introduce BOOST, a simple attack that leverages only the eos tokens.
Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z) - Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [38.25697806663553]
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks.
We achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench.
arXiv Detail & Related papers (2024-04-02T17:58:27Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [39.829517061574364]
Even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks"
We propose the generation exploitation attack, which disrupts model alignment by only manipulating variations of decoding methods.
Our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs.
arXiv Detail & Related papers (2023-10-10T20:15:54Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.