Related papers: LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

URL: http://arxiv.org/abs/2310.20624v2
Date: Wed, 22 May 2024 08:39:46 GMT
Title: LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Authors: Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish,
Abstract summary: We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. Our technique significantly reduces the rate at which the model refuses to follow harmful instructions. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments.
Score: 0.10414713311972776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. We employ quantized low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than \$200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve refusal rates of about 1\% for our 70B Llama 2-Chat model on two refusal benchmarks. Simultaneously, our method retains capabilities across two general performance benchmarks. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights. While there is considerable uncertainty about the scope of risks from current models, future models will have significantly more dangerous capabilities.

Related papers

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models [63.63254955809224]
We propose a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance.
arXiv Detail & Related papers (2025-02-18T02:51:17Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Chained Tuning Leads to Biased Forgetting [20.181135590652985]
We show that models trained on downstream tasks forget their safety tuning to a greater extent than models trained in the opposite order. We show that forgetting disproportionately impacts safety information about certain groups.
arXiv Detail & Related papers (2024-12-21T03:51:58Z)
Rule Based Rewards for Language Model Safety [14.444217964594108]
Rule Based Rewards (RBR) uses a collection of rules for desired or undesired behaviors. RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7.
arXiv Detail & Related papers (2024-11-02T02:22:21Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. We present a new perspective on safety research i.e., red-teaming through Unalignment. Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z)
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions [79.1824160877979]
We show that several popular instruction-tuned models are highly unsafe. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks.
arXiv Detail & Related papers (2023-09-14T17:23:37Z)
Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors [41.45649235969172]
Self-ensemble protection (SEP) is proposed to prevent training good models on the data. SEP is verified to be a new state-of-the-art, e.g., our small perturbations reduce the accuracy of a CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known method.
arXiv Detail & Related papers (2022-11-22T04:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.