Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
- URL: http://arxiv.org/abs/2506.18931v1
- Date: Sat, 21 Jun 2025 14:59:54 GMT
- Title: Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
- Authors: Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn,
- Abstract summary: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs.<n>Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs.<n>We propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment.
- Score: 4.580092836731863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs. The code is available at https://github.com/AoShuang92/SPLoRA.
Related papers
- LoRA is All You Need for Safety Alignment of Reasoning LLMs [14.561805865086948]
We show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities.<n>This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights.
arXiv Detail & Related papers (2025-07-22T23:25:16Z) - LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation [41.91948079316541]
Recent studies have raised concerns that LoRA fine-tuning could potentially compromise the safety alignment in large language models.<n>In this paper, we propose Safety-alignment preserved Low-Rank Adaptation (SaLoRA)<n>Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments.<n>Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.
arXiv Detail & Related papers (2025-01-03T11:34:28Z) - Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.<n>We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.<n>We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z) - Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs.<n>The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones.<n>Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference.
We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z) - Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models [51.20476412037321]
We propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace.<n>Our experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model.
arXiv Detail & Related papers (2024-05-27T05:04:05Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.