Related papers: Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

URL: http://arxiv.org/abs/2512.11391v1
Date: Fri, 12 Dec 2025 09:01:52 GMT
Title: Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization
Authors: Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li,
Abstract summary: Safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities.<n>We introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment.<n>NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment.
Score: 15.729169158082598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks. Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.

Related papers

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection [52.551864761088574]
Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility.<n>We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment.<n>We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA) to balance plasticity and stability.
arXiv Detail & Related papers (2026-02-08T09:53:46Z)
Understanding and Preserving Safety in Fine-Tuned LLMs [20.821783178639063]
Fine-tuning can substantially degrade safety alignment, even when the fine-tuning data is harmless.<n>We propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace.<n> SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios.
arXiv Detail & Related papers (2026-01-15T07:33:13Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Superficial Safety Alignment Hypothesis [15.215130286922564]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction.<n>We identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU) and Redundant Unit (RU)<n>Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.<n>However, ensuring the safety of LLMs during fine-tuning remains a critical concern.<n>We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement Learning [57.84059344739159]
"Shielding" is a popular technique to enforce safety inReinforcement Learning (RL) We propose a new permissibility-based framework to deal with safety and shield construction.
arXiv Detail & Related papers (2024-05-29T18:00:21Z)
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints [15.904640266226023]
We design a safety model that performs credit assignment to assess contributions of partial state-action trajectories on safety. We derive an effective algorithm for optimizing a safe policy using the learned safety model. We devise a method to dynamically adapt the tradeoff coefficient between safety reward and safety compliance.
arXiv Detail & Related papers (2024-05-05T17:27:22Z)
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms [8.789204441461678]
We present a solution of the safe exploration (GSE) problem in the form of a meta-algorithm for safe exploration, MASE. Our proposed algorithm achieves better performance than state-of-the-art algorithms on grid-world and Safety Gym benchmarks.
arXiv Detail & Related papers (2023-10-05T00:47:09Z)
SafeDreamer: Safe Reinforcement Learning with World Models [7.773096110271637]
We introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input.
arXiv Detail & Related papers (2023-07-14T06:00:08Z)
Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL. We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.