Related papers: Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

URL: http://arxiv.org/abs/2309.07875v3
Date: Tue, 19 Mar 2024 16:50:50 GMT
Title: Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Authors: Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou,
Abstract summary: We show that several popular instruction-tuned models are highly unsafe. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks.
Score: 79.1824160877979
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.

Related papers

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity. We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM) UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z)
Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes [1.0779346838250028]
Latent Prototype Moderator (LPM) is a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety.<n>LPM matches or exceeds state-of-the-art guard models across multiple safety benchmarks.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
Chained Tuning Leads to Biased Forgetting [20.181135590652985]
We show that models trained on downstream tasks forget their safety tuning to a greater extent than models trained in the opposite order. We show that forgetting disproportionately impacts safety information about certain groups.
arXiv Detail & Related papers (2024-12-21T03:51:58Z)
Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make. We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy. For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives. We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z)
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [64.5204594279587]
A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. We propose to balance safety and helpfulness in diverse use cases by controlling both attributes in large language models.
arXiv Detail & Related papers (2024-04-01T17:59:06Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B [0.10414713311972776]
We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. Our technique significantly reduces the rate at which the model refuses to follow harmful instructions. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments.
arXiv Detail & Related papers (2023-10-31T16:55:06Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.