Related papers: From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

URL: http://arxiv.org/abs/2407.02855v3
Date: Tue, 20 May 2025 05:38:08 GMT
Title: From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
Authors: Zhexin Zhang, Junxiao Yang, Yida Lu, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang,
Abstract summary: Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks.<n>LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase.<n>We empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate.
Score: 85.84979847888157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Consequently, unlearning-based approaches have been proposed to mitigate jailbreak attacks by directly removing harmful knowledge from the model. In this paper, we identify a novel ripple effect of unlearning, wherein LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase (e.g., a model unlearning the steps for theft may also implicitly unlearn the steps for making a bomb). Through over 100 experimental runs spanning multiple models, attack strategies, and defense methods, we empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate on unseen data from more than 70% to less than 10% with only 100 training samples. Further analysis reveals that the strong generalization ability of unlearning may stem from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions in response, and similarity among their learned representations in the LLM). We also discuss the potential limitations of unlearning and the observed ripple effect. We hope our research could contribute to a deeper understanding of unlearning. Our code is available at https://github.com/thu-coai/SafeUnlearning.

Related papers

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs [40.958137601841734]
A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.
arXiv Detail & Related papers (2025-02-27T06:49:16Z)
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions [11.32190215797298]
Large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.
arXiv Detail & Related papers (2025-02-06T18:59:02Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples. We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans. LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses. We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
An Adversarial Perspective on Machine Unlearning for AI Safety [22.639683142004372]
This work challenges the fundamental differences between unlearning and traditional safety post-training.<n>We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.<n>For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
arXiv Detail & Related papers (2024-09-26T16:32:19Z)
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models. We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts. We introduce BOOST, a simple attack that leverages only the eos tokens. Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models [34.557309967708406]
In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. We design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics.
arXiv Detail & Related papers (2024-05-14T04:51:23Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z)
Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs. Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates. We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z)
Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs) Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z)
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [98.18718484152595]
We propose to integrate goal prioritization at both training and inference stages to counteract the intrinsic conflict between the goals of being helpful and ensuring safety. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety.
arXiv Detail & Related papers (2023-11-15T16:42:29Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.