Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
- URL: http://arxiv.org/abs/2512.10150v1
- Date: Wed, 10 Dec 2025 23:16:47 GMT
- Title: Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
- Authors: Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, Bernard Ghanem,
- Abstract summary: We study the safety degradation that comes with adapting large language models to new tasks.<n>We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task.<n>We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation.
- Score: 79.45860948246742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.
Related papers
- Token-level Data Selection for Safe LLM Fine-tuning [15.039068315115372]
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications.<n>Recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety.<n>We propose a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model.
arXiv Detail & Related papers (2026-03-01T16:52:05Z) - SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge [51.634837361795434]
SaFeR-CLIP reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods.<n>We also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift.
arXiv Detail & Related papers (2025-11-20T19:00:15Z) - UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Gradient Surgery for Safe LLM Fine-Tuning [16.652518818576425]
Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs)<n>We find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases.<n>We propose SafeGrad, a novel method that employs gradient surgery.
arXiv Detail & Related papers (2025-08-10T04:13:41Z) - SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models [86.11849528108199]
Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain.<n>Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries.
arXiv Detail & Related papers (2025-05-29T13:31:51Z) - SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging [30.820398160975504]
Fine-tuning large language models (LLMs) can erode safety alignment, causing LLMs to respond to harmful or unethical prompts.<n>We propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance.<n>Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning.
arXiv Detail & Related papers (2025-03-21T15:44:09Z) - Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements [46.79887158348167]
The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach.<n>We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training.
arXiv Detail & Related papers (2024-10-11T16:38:01Z) - Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning [1.3307486544794784]
Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety.
This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification.
Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
arXiv Detail & Related papers (2024-09-18T08:04:24Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.