Related papers: Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

URL: http://arxiv.org/abs/2501.18100v1
Date: Thu, 30 Jan 2025 02:47:09 GMT
Title: Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Authors: Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao,
Abstract summary: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.<n> Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.<n>We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
Score: 58.7395356511539
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea

Related papers

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning [24.176983833455413]
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications.<n>These models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns.<n>We propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning.
arXiv Detail & Related papers (2025-07-22T07:40:16Z)
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization [7.1060720569792215]
Fine-tuning large language models (LLMs) can inadvertently compromise their safety.<n>We introduce a safety-aware probing (SAP) framework designed to mitigate the safety risks.<n>Our experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model.
arXiv Detail & Related papers (2025-05-22T14:52:10Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [38.7113305301502]
LookAhead Tuning mitigates the degradation of model safety during fine-tuning. Two simple, low-resource, and effective data-driven methods modify training data by previewing partial answer prefixes.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
The effect of fine-tuning on language model toxicity [7.539523407936451]
Fine-tuning language models has become increasingly popular following the proliferation of open models. We assess how fine-tuning can impact different open models' propensity to output toxic content. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation can significantly alter these results.
arXiv Detail & Related papers (2024-10-21T09:39:09Z)
Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make. We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy. For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z)
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation [7.945893812374361]
Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service. We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
arXiv Detail & Related papers (2024-09-03T03:59:22Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns. We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z)
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B [0.10414713311972776]
We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. Our technique significantly reduces the rate at which the model refuses to follow harmful instructions. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments.
arXiv Detail & Related papers (2023-10-31T16:55:06Z)
Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning [94.35586521144117]
We investigate whether applying contrastive learning to fine-tuning would bring further benefits. We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
arXiv Detail & Related papers (2021-02-12T16:31:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.