Related papers: Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

URL: http://arxiv.org/abs/2512.23260v2
Date: Mon, 05 Jan 2026 13:39:39 GMT
Title: Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation
Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Zhenbo Xu, Lechen Ning, Huijia Wu, Zhaofeng He,
Abstract summary: Safety alignment is critical for training large language models to refuse harmful requests.<n>Low-Rank Adaptation (LoRA) consistently underperforms full fine-tuning and reinforcement learning on safety benchmarks.<n>We propose SAILS (Safety Alignment via Interpretable Low-rank Subspace) to address this gap.
Score: 13.509767769174422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment -- training large language models (LLMs) to refuse harmful requests while remaining helpful -- is critical for responsible deployment. Prior work established that safety behaviors are governed by low-rank structures, suggesting parameter-efficient fine-tuning (PEFT) should be well-suited for alignment. However, Low-Rank Adaptation (LoRA) consistently underperforms full fine-tuning and reinforcement learning on safety benchmarks. We attribute this gap to semantic entanglement: safety-relevant directions are intertwined with unrelated concepts due to polysemanticity, impeding implicit subspace identification. To address this, we propose SAILS (Safety Alignment via Interpretable Low-rank Subspace), which leverages Sparse Autoencoders (SAEs) to disentangle representations into monosemantic features, constructs an interpretable safety subspace from SAE decoder directions, and uses it to initialize LoRA adapters. Theoretically, we prove that SAE-based identification achieves arbitrarily small recovery error under monosemanticity assumptions, while direct identification suffers an irreducible error floor. Empirically, SAILS achieves up to 99.6% safety rate on Gemma-2-9B -- exceeding full fine-tuning by 7.4 points and matching RLHF-based models -- while updating only 0.19% of parameters and providing interpretability.

Related papers

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion [16.434293020863592]
The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities.<n>We introduce LSSF, a novel safety re-alignment framework with underlineLow-Rank underlineSafety underlineSubspace underlineFusion.<n>Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix.
arXiv Detail & Related papers (2026-01-19T03:59:12Z)
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering [5.124731939041066]
Large language models (LLMs) face critical safety challenges as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks.<n>We introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph.<n>We demonstrate that GSAE enables effective safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism.
arXiv Detail & Related papers (2025-12-07T04:46:30Z)
SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge [51.634837361795434]
SaFeR-CLIP reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods.<n>We also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift.
arXiv Detail & Related papers (2025-11-20T19:00:15Z)
RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification [17.66826792670962]
We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable.<n>RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space.<n>RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space.
arXiv Detail & Related papers (2025-10-30T18:46:34Z)
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space [91.99501941169831]
GuardSpace is a guardrail framework for preserving safety alignment throughout fine-tuning.<n>For Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT.
arXiv Detail & Related papers (2025-10-16T04:57:53Z)
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection [47.347413305965006]
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests.<n>Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions.<n>We propose Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace.
arXiv Detail & Related papers (2025-08-28T13:22:33Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Regularizing Subspace Redundancy of Low-Rank Adaptation [54.473090597164834]
We propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation.<n>Our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks.<n>As a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs.
arXiv Detail & Related papers (2025-07-28T11:52:56Z)
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
Model Unlearning via Sparse Autoencoder Subspace Guided Projections [34.47648738350138]
Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns.<n>Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder steering, either lack interpretability or fail to provide a robust defense against adversarial prompts.<n>We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space.
arXiv Detail & Related papers (2025-05-30T10:07:52Z)
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs.<n>The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones.<n>Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z)
Superficial Safety Alignment Hypothesis [15.215130286922564]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction.<n>We identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU) and Redundant Unit (RU)<n>Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning [0.0]
differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets.<n>Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD.
arXiv Detail & Related papers (2024-02-12T17:24:15Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Efficient Semantic Image Synthesis via Class-Adaptive Normalization [116.63715955932174]
Class-adaptive normalization (CLADE) is a lightweight but equally-effective variant that is only adaptive to semantic class. We introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE. The proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE.
arXiv Detail & Related papers (2020-12-08T18:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.