NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
- URL: http://arxiv.org/abs/2412.12497v1
- Date: Tue, 17 Dec 2024 02:59:04 GMT
- Title: NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
- Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He,
- Abstract summary: A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.
Existing methods to counteract fine-tuning attacks typically require substantial computational resources.
We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
- Score: 37.024666077902225
- License:
- Abstract: The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{https://github.com/xinykou/NLSR}
Related papers
- Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.
We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.
This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Locking Down the Finetuned LLMs Safety [33.56657036839617]
Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks.
Existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning.
We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning.
arXiv Detail & Related papers (2024-10-14T09:58:29Z) - Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make.
We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy.
For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - A safety realignment framework via subspace-oriented model fusion for large language models [22.588716190505963]
We introduce a safety realignment framework through subspace-oriented model fusion (SOMF)
Our approach begins by disentangling all task vectors from the weights of each fine-tuned model.
We then identify safety-related regions within these vectors by subspace masking techniques.
arXiv Detail & Related papers (2024-05-15T03:04:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.