Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
- URL: http://arxiv.org/abs/2407.09121v1
- Date: Fri, 12 Jul 2024 09:36:33 GMT
- Title: Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
- Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu,
- Abstract summary: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
- Score: 67.30423823744506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.
Related papers
- Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries.
As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.
This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z) - Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models [51.85781332922943]
Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing.
We for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method.
arXiv Detail & Related papers (2024-06-15T13:24:22Z) - Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)
We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.
Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.