Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
- URL: http://arxiv.org/abs/2402.01109v6
- Date: Sun, 24 Nov 2024 20:09:55 GMT
- Title: Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
- Authors: Tiansheng Huang, Sihao Hu, Ling Liu,
- Abstract summary: A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model.
We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
- Score: 7.653580388741887
- License:
- Abstract: The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.
Related papers
- Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization [60.176008034221404]
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences.
Prior work has observed that the likelihood of preferred responses often decreases during training.
We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
arXiv Detail & Related papers (2024-10-11T14:22:44Z) - Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation [7.945893812374361]
Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service.
We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
arXiv Detail & Related papers (2024-09-03T03:59:22Z) - Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning [7.9447287301860445]
Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment.
Existing mitigation strategies include alignment stage solutions citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions citehuang2024lazy,mukhoti2023fine.
We propose Antidote, a post-fine-tuning stage solution, which remains textbftextitagnostic to
arXiv Detail & Related papers (2024-08-18T21:45:03Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [17.49244337226907]
We show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections.
Our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.
arXiv Detail & Related papers (2023-11-15T23:52:05Z) - On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks [45.81957796169348]
Backdoor attacks are an insidious security threat against machine learning models.
We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks.
Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
arXiv Detail & Related papers (2023-05-25T22:08:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.