JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification
- URL: http://arxiv.org/abs/2601.03005v1
- Date: Tue, 06 Jan 2026 13:30:10 GMT
- Title: JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification
- Authors: Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu,
- Abstract summary: We show that Large Language Models (LLMs) often fail against jailbreak attacks.<n>We propose $textbfJ$ailbreak $textbfP$ath $textbfU$nlearning (JPU) to rectify dynamic jailbreak paths towards safety anchors.
- Score: 18.505062396846565
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite extensive safety alignment, Large Language Models (LLMs) often fail against jailbreak attacks. While machine unlearning has emerged as a promising defense by erasing specific harmful parameters, current methods remain vulnerable to diverse jailbreaks. We first conduct an empirical study and discover that this failure mechanism is caused by jailbreaks primarily activating non-erased parameters in the intermediate layers. Further, by probing the underlying mechanism through which these circumvented parameters reassemble into the prohibited output, we verify the persistent existence of dynamic $\textbf{jailbreak paths}$ and show that the inability to rectify them constitutes the fundamental gap in existing unlearning defenses. To bridge this gap, we propose $\textbf{J}$ailbreak $\textbf{P}$ath $\textbf{U}$nlearning (JPU), which is the first to rectify dynamic jailbreak paths towards safety anchors by dynamically mining on-policy adversarial samples to expose vulnerabilities and identify jailbreak paths. Extensive experiments demonstrate that JPU significantly enhances jailbreak resistance against dynamic attacks while preserving the model's utility.
Related papers
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models [2.6140509675507384]
We study jailbreaking from both security and interpretability perspectives.<n>We propose a tensor-based latent representation framework that captures structure in hidden activations.<n>Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures.
arXiv Detail & Related papers (2026-02-12T02:43:17Z) - Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning [48.100552417137656]
PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions.<n>We conduct extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
arXiv Detail & Related papers (2025-09-28T01:38:00Z) - LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z) - Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z) - JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation [22.75124155879712]
Large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M.
arXiv Detail & Related papers (2025-02-11T13:50:50Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit [12.392258585661446]
Large language models (LLMs) are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses.<n>This paper proposes JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation and circuit perspectives.
arXiv Detail & Related papers (2024-11-17T16:08:34Z) - What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [6.392966062933521]
We introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods.<n>We train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success.<n>To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction.
arXiv Detail & Related papers (2024-11-02T17:29:47Z) - MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks [2.873719680183099]
This paper advocates for the significance of jailbreak attack prevention on Large Language Models (LLMs)
We introduce MoJE, a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails.
MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference.
arXiv Detail & Related papers (2024-09-26T10:12:19Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.