Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
- URL: http://arxiv.org/abs/2402.01109v6
- Date: Sun, 24 Nov 2024 20:09:55 GMT
- Title: Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
- Authors: Tiansheng Huang, Sihao Hu, Ling Liu,
- Abstract summary: A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model.
We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
- Score: 7.653580388741887
- License:
- Abstract: The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.
Related papers
- Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z) - Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation [7.945893812374361]
We show that purely relying on the moderation guardrail for data filtration is not reliable.
Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data.
Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100% leakage ratio.
arXiv Detail & Related papers (2025-01-29T06:24:58Z) - NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning [37.024666077902225]
A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.
Existing methods to counteract fine-tuning attacks typically require substantial computational resources.
We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
arXiv Detail & Related papers (2024-12-17T02:59:04Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.
We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.
This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning [7.9447287301860445]
Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment.
Existing mitigation strategies include alignment stage solutions citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions citehuang2024lazy,mukhoti2023fine.
We propose Antidote, a post-fine-tuning stage solution, which remains textbftextitagnostic to
arXiv Detail & Related papers (2024-08-18T21:45:03Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [17.49244337226907]
We show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections.
Our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.
arXiv Detail & Related papers (2023-11-15T23:52:05Z) - On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.