Related papers: What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

URL: http://arxiv.org/abs/2407.10264v3
Date: Wed, 21 Aug 2024 15:12:37 GMT
Title: What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Authors: Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H. S. Torr, Amartya Sanyal, Puneet K. Dokania,
Abstract summary: Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
Score: 64.9691741899956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb"). Using this, we investigate three well-known safety fine-tuning methods -- supervised safety fine-tuning, direct preference optimization, and unlearning -- and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights' null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe. We validate our findings, wherever possible, on real-world models -- specifically, Llama-2 7B and Llama-3 8B.

Related papers

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
SAFER: Probing Safety in Reward Models with Sparse Autoencoder [15.804171763844323]
We present sparse Autoencoder For Enhanced Reward model (textbfSAFER)<n>We uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making.<n>Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification.
arXiv Detail & Related papers (2025-07-01T11:04:03Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity. We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM) UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z)
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models [63.63254955809224]
We propose a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance.
arXiv Detail & Related papers (2025-02-18T02:51:17Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness. Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make. We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy. For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives. We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z)
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z)
A safety realignment framework via subspace-oriented model fusion for large language models [22.588716190505963]
We introduce a safety realignment framework through subspace-oriented model fusion (SOMF) Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques.
arXiv Detail & Related papers (2024-05-15T03:04:05Z)
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. This paper introduces a training-free attack method capable of reversing safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects [27.41101006357176]
In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. operating based on an incomplete model can often produce unintended negative side effects (NSEs)
arXiv Detail & Related papers (2023-04-06T14:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.