Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt
Templates
- URL: http://arxiv.org/abs/2402.18540v1
- Date: Wed, 28 Feb 2024 18:23:49 GMT
- Title: Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt
Templates
- Authors: Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev
Arora
- Abstract summary: This paper proposes the "Pure Tuning, Safe Testing" (PTST) principle -- fine-tune models without a safety prompt, but include it at test time.
Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors.
- Score: 59.0123809721502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Public LLMs such as the Llama 2-Chat have driven huge activity in LLM
research. These models underwent alignment training and were considered safe.
Recently Qi et al. (2023) reported that even benign fine-tuning (e.g., on
seemingly safe datasets) can give rise to unsafe behaviors in the models. The
current paper is about methods and best practices to mitigate such loss of
alignment. Through extensive experiments on several chat models (Meta's Llama
2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo),
this paper uncovers that the prompt templates used during fine-tuning and
inference play a crucial role in preserving safety alignment, and proposes the
"Pure Tuning, Safe Testing" (PTST) principle -- fine-tune models without a
safety prompt, but include it at test time. Fine-tuning experiments on GSM8K,
ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of
unsafe behaviors, and even almost eliminates them in some cases.
Related papers
- What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks.
Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks.
We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z) - Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations [0.0]
This paper introduces fourteen novel datasets for the evaluation of Large Language Models' safety in the context of enterprise tasks.
A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content.
arXiv Detail & Related papers (2024-04-15T13:40:08Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains.
LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content.
This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.