Immunization against harmful fine-tuning attacks
- URL: http://arxiv.org/abs/2402.16382v2
- Date: Thu, 03 Oct 2024 16:39:32 GMT
- Title: Immunization against harmful fine-tuning attacks
- Authors: Domenic Rosati, Jan Wehner, Kai Williams, Ćukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz,
- Abstract summary: Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation.
However, such safety training can be removed by fine-tuning the LLM on harmful datasets.
We introduce a formal framework based on the training budget of an attacker which we call "Immunization" conditions.
- Score: 21.97813820548174
- License:
- Abstract: Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation. However, such safety training can be removed by fine-tuning the LLM on harmful datasets. While this emerging threat (harmful fine-tuning attacks) has been characterized by previous work, there is little understanding of how we should proceed in constructing and validating defenses against these attacks especially in the case where defenders would not have control of the fine-tuning process. We introduce a formal framework based on the training budget of an attacker which we call "Immunization" conditions. Using a formal characterisation of the harmful fine-tuning problem, we provide a thorough description of what a successful defense must comprise of and establish a set of guidelines on how rigorous defense research that gives us confidence should proceed.
Related papers
- Defending against Reverse Preference Attacks is Difficult [26.872318173182414]
Large Language Models (LLMs) are vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets.
We propose Reverse Preference Attacks (RPA) to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback.
arXiv Detail & Related papers (2024-09-19T17:10:34Z) - Turning Generative Models Degenerate: The Power of Data Poisoning Attacks [10.36389246679405]
Malicious actors can introduce backdoors through poisoning attacks to generate undesirable outputs.
We conduct an investigation of various poisoning techniques targeting the large language models' fine-tuning phase via the Efficient Fine-Tuning (PEFT) method.
Our study presents the first systematic approach to understanding poisoning attacks targeting NLG tasks during fine-tuning via PEFT.
arXiv Detail & Related papers (2024-07-17T03:02:15Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses [42.136793654338106]
We introduce a new safety evaluation framework based on impermissible information leakage of model outputs.
We show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship.
arXiv Detail & Related papers (2024-07-02T16:19:25Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models [51.85781332922943]
Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing.
We for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method.
arXiv Detail & Related papers (2024-06-15T13:24:22Z) - Representation Noising: A Defence Mechanism Against Harmful Finetuning [28.451676139178687]
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes.
We propose Representation Noising (RepNoise), a defence mechanism that operates even when attackers have access to the weights.
arXiv Detail & Related papers (2024-05-23T13:51:55Z) - Can Adversarial Training Be Manipulated By Non-Robust Features? [64.73107315313251]
Adversarial training, originally designed to resist test-time adversarial examples, has shown to be promising in mitigating training-time availability attacks.
We identify a novel threat model named stability attacks, which aims to hinder robust availability by slightly perturbing the training data.
Under this threat, we find that adversarial training using a conventional defense budget $epsilon$ provably fails to provide test robustness in a simple statistical setting.
arXiv Detail & Related papers (2022-01-31T16:25:25Z) - Guided Adversarial Attack for Evaluating and Enhancing Adversarial
Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training.
We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries.
We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z) - On Adaptive Attacks to Adversarial Example Defenses [123.32678153377915]
This paper lays out the methodology and the approach necessary to perform an adaptive attack against defenses to adversarial examples.
We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples.
arXiv Detail & Related papers (2020-02-19T18:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.