Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
- URL: http://arxiv.org/abs/2501.18100v1
- Date: Thu, 30 Jan 2025 02:47:09 GMT
- Title: Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
- Authors: Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao,
- Abstract summary: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
- Score: 58.7395356511539
- License:
- Abstract: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea
Related papers
- Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.
We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z) - The effect of fine-tuning on language model toxicity [7.539523407936451]
Fine-tuning language models has become increasingly popular following the proliferation of open models.
We assess how fine-tuning can impact different open models' propensity to output toxic content.
We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation can significantly alter these results.
arXiv Detail & Related papers (2024-10-21T09:39:09Z) - Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make.
We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy.
For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z) - Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation [7.945893812374361]
Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service.
We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
arXiv Detail & Related papers (2024-09-03T03:59:22Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Unleashing the Power of Contrastive Self-Supervised Visual Models via
Contrast-Regularized Fine-Tuning [94.35586521144117]
We investigate whether applying contrastive learning to fine-tuning would bring further benefits.
We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
arXiv Detail & Related papers (2021-02-12T16:31:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.