Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
- URL: http://arxiv.org/abs/2505.12038v1
- Date: Sat, 17 May 2025 15:01:07 GMT
- Title: Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
- Authors: Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang,
- Abstract summary: We propose Safe Delta, a safety-aware post-training defense method for large language models (LLMs)<n>Our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
- Score: 49.412887135146725
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model's alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
Related papers
- Token-level Data Selection for Safe LLM Fine-tuning [15.039068315115372]
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications.<n>Recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety.<n>We propose a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model.
arXiv Detail & Related papers (2026-03-01T16:52:05Z) - Reimagining Safety Alignment with An Image [49.33281424100804]
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries.<n>We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal.
arXiv Detail & Related papers (2025-11-01T11:27:07Z) - A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space [91.99501941169831]
GuardSpace is a guardrail framework for preserving safety alignment throughout fine-tuning.<n>For Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT.
arXiv Detail & Related papers (2025-10-16T04:57:53Z) - UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Gradient Surgery for Safe LLM Fine-Tuning [16.652518818576425]
Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs)<n>We find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases.<n>We propose SafeGrad, a novel method that employs gradient surgery.
arXiv Detail & Related papers (2025-08-10T04:13:41Z) - HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization [7.1060720569792215]
Fine-tuning large language models (LLMs) can inadvertently compromise their safety.<n>We introduce a safety-aware probing (SAP) framework designed to mitigate the safety risks.<n>Our experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model.
arXiv Detail & Related papers (2025-05-22T14:52:10Z) - TraCeS: Trajectory Based Credit Assignment From Sparse Safety Feedback [15.904640266226023]
In safe reinforcement learning (RL), auxiliary safety costs are used to align the agent to safe decision making.<n>In practice, safety constraints, including cost functions and budgets, are unknown or hard to specify.<n>We address a general setting where the true safety definition is unknown, and has to be learned from sparsely labeled data.
arXiv Detail & Related papers (2025-04-17T01:11:08Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - Locking Down the Finetuned LLMs Safety [33.56657036839617]
Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks.
Existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning.
We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning.
arXiv Detail & Related papers (2024-10-14T09:58:29Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives.
We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios.
Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment [56.2017039028998]
Fine-tuning of Language-Model-as-a-Service (LM) introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack)
We propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks.
Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 safety examples, the maliciously finetuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance.
arXiv Detail & Related papers (2024-02-22T21:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.