Related papers: Superficial Safety Alignment Hypothesis

Superficial Safety Alignment Hypothesis

URL: http://arxiv.org/abs/2410.10862v2
Date: Thu, 02 Oct 2025 16:15:20 GMT
Title: Superficial Safety Alignment Hypothesis
Authors: Jianwei Li, Jung-Eun Kim,
Abstract summary: We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction.<n>We identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU) and Redundant Unit (RU)<n>Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
Score: 15.215130286922564
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users' requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.

Related papers

PoSafeNet: Safe Learning with Poset-Structured Neural Nets [49.854863600271614]
existing approaches often enforce multiple safety constraints uniformly or via fixed priority orders, leading to infeasibility and brittle behavior.<n>We formalize this setting as poset-structured safety, modeling safety constraints as a partially ordered set and treating safety composition as a structural property of the policy class.<n>Building on this formulation, we propose PoSafeNet, a differentiable neural safety layer that enforces safety via sequential closed-form projection.
arXiv Detail & Related papers (2026-01-29T22:03:32Z)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z)
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion [16.434293020863592]
The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities.<n>We introduce LSSF, a novel safety re-alignment framework with underlineLow-Rank underlineSafety underlineSubspace underlineFusion.<n>Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix.
arXiv Detail & Related papers (2026-01-19T03:59:12Z)
Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation [13.509767769174422]
Safety alignment is critical for training large language models to refuse harmful requests.<n>Low-Rank Adaptation (LoRA) consistently underperforms full fine-tuning and reinforcement learning on safety benchmarks.<n>We propose SAILS (Safety Alignment via Interpretable Low-rank Subspace) to address this gap.
arXiv Detail & Related papers (2025-12-29T07:39:49Z)
EASE: Practical and Efficient Safety Alignment for Small Language Models [4.839980912290382]
Small language models (SLMs) are increasingly deployed on edge devices, making their safety alignment crucial yet challenging.<n>We propose EASE, a novel framework that enables practical and Efficient safety alignment for Small languagE models.
arXiv Detail & Related papers (2025-11-09T19:46:54Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection [47.347413305965006]
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests.<n>Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions.<n>We propose Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace.
arXiv Detail & Related papers (2025-08-28T13:22:33Z)
Should LLM Safety Be More Than Refusing Harmful Instructions? [6.5137518437747]
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts.<n>We introduce a two-dimensional framework for assessing LLM safety.<n>We demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks.
arXiv Detail & Related papers (2025-06-03T05:00:12Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Almost Surely Safe Alignment of Large Language Models at Inference-Time [20.5164976103514]
Even highly capable large language models (LLMs) can produce biased or unsafe responses. This paper introduces a novel inference-time alignment approach. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process.
arXiv Detail & Related papers (2025-02-03T09:59:32Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
On the Role of Attention Heads in Large Language Model Safety [64.51534137177491]
Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships) to assess the individual heads' contributions to model safety.
arXiv Detail & Related papers (2024-10-17T16:08:06Z)
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements [46.79887158348167]
The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training.
arXiv Detail & Related papers (2024-10-11T16:38:01Z)
Safety Layers in Aligned Large Language Models: The Key to LLM Security [43.805905164456846]
Internal parameters can be vulnerable to security degradation when fine-tuned with non-malicious backdoor or normal data. We identify a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones. We propose a novel fine-tuning approach, Safely Partial Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation.
arXiv Detail & Related papers (2024-08-30T04:35:59Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA. textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
A safety realignment framework via subspace-oriented model fusion for large language models [22.588716190505963]
We introduce a safety realignment framework through subspace-oriented model fusion (SOMF) Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques.
arXiv Detail & Related papers (2024-05-15T03:04:05Z)
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [69.13807233595455]
Large language models (LLMs) show inherent brittleness in their safety mechanisms. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted.
arXiv Detail & Related papers (2024-02-07T18:34:38Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.