Related papers: Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

URL: http://arxiv.org/abs/2509.15202v1
Date: Thu, 18 Sep 2025 17:54:31 GMT
Title: Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Authors: Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu,
Abstract summary: Jailbreak attacks pose persistent threats to large language models (LLMs)<n>We introduce DeepRefusal, a robust safety alignment framework that overcomes these issues.<n>Our method reduces attack success rates by approximately 95%, while maintaining model capabilities with minimal performance degradation.
Score: 21.03567306455414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks pose persistent threats to large language models (LLMs). Current safety alignment methods have attempted to address these issues, but they experience two significant limitations: insufficient safety alignment depth and unrobust internal defense mechanisms. These limitations make them vulnerable to adversarial attacks such as prefilling and refusal direction manipulation. We introduce DeepRefusal, a robust safety alignment framework that overcomes these issues. DeepRefusal forces the model to dynamically rebuild its refusal mechanisms from jailbreak states. This is achieved by probabilistically ablating the refusal direction across layers and token depths during fine-tuning. Our method not only defends against prefilling and refusal direction attacks but also demonstrates strong resilience against other unseen jailbreak strategies. Extensive evaluations on four open-source LLM families and six representative attacks show that DeepRefusal reduces attack success rates by approximately 95%, while maintaining model capabilities with minimal performance degradation.

Related papers

Fail-Closed Alignment for Large Language Models [4.205036273334146]
We propose fail-closed alignment as a design principle for robust large language model safety.<n>We present a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions.<n>Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously.
arXiv Detail & Related papers (2026-02-19T00:33:35Z)
EASE: Practical and Efficient Safety Alignment for Small Language Models [4.839980912290382]
Small language models (SLMs) are increasingly deployed on edge devices, making their safety alignment crucial yet challenging.<n>We propose EASE, a novel framework that enables practical and Efficient safety alignment for Small languagE models.
arXiv Detail & Related papers (2025-11-09T19:46:54Z)
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack [22.48980625853356]
Large language models (LLMs) exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes.<n>In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability.
arXiv Detail & Related papers (2025-09-30T06:33:52Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models [8.024771725860127]
Jailbreak attacks manipulate large language models into generating harmful content.<n>Jailbreak Antidote enables real-time adjustment of safety preferences by manipulating a sparse subset of the model's internal states.<n>Our analysis reveals that safety-related information in LLMs is sparsely distributed.
arXiv Detail & Related papers (2024-10-03T08:34:17Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.