Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
- URL: http://arxiv.org/abs/2408.09600v2
- Date: Tue, 3 Sep 2024 03:45:21 GMT
- Title: Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
- Authors: Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu,
- Abstract summary: Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment.
Existing mitigation strategies include alignment stage solutions citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions citehuang2024lazy,mukhoti2023fine.
We propose Antidote, a post-fine-tuning stage solution, which remains textbftextitagnostic to
- Score: 7.9447287301860445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions \cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions \cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at \url{https://huangtiansheng.github.io/Antidote_gh_page/}
Related papers
- Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.
We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.
We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z) - Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.
We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.
This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation [7.945893812374361]
Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service.
We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
arXiv Detail & Related papers (2024-09-03T03:59:22Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Representation Noising: A Defence Mechanism Against Harmful Finetuning [28.451676139178687]
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes.
We propose Representation Noising (RepNoise), a defence mechanism that operates even when attackers have access to the weights.
arXiv Detail & Related papers (2024-05-23T13:51:55Z) - Immunization against harmful fine-tuning attacks [21.97813820548174]
Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation.
However, such safety training can be removed by fine-tuning the LLM on harmful datasets.
We introduce a formal framework based on the training budget of an attacker which we call "Immunization" conditions.
arXiv Detail & Related papers (2024-02-26T08:08:03Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack [7.653580388741887]
A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model.
We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
arXiv Detail & Related papers (2024-02-02T02:56:50Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.