Related papers: Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

URL: http://arxiv.org/abs/2602.17546v1
Date: Thu, 19 Feb 2026 16:59:54 GMT
Title: Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Authors: Jyotin Goel, Souvik Maji, Pratik Mazumder,
Abstract summary: Existing defenses offer limited protection or force a trade-off between safety and utility.<n>We introduce a training framework that adapts regularization in response to safety risk.<n>We empirically verify that harmful intent signals are predictable from pre-generation activations.
Score: 2.9184958249079975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

Related papers

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z)
SafePred: A Predictive Guardrail for Computer-Using Agents via World Models [12.569157125705052]
We present SafePred, a predictive guardrail framework for Computer-using Agents (CUAs) in complex real-world environments.<n>Based on this approach, we present SafePred that establishes a risk-to-decision loop to ensure safe agent behavior.<n>Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.
arXiv Detail & Related papers (2026-02-02T07:04:06Z)
Self-Guard: Defending Large Reasoning Models via enhanced self-reflection [54.775612141528164]
Self-Guard is a lightweight safety defense framework for Large Reasoning Models.<n>It bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility.<n>Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales.
arXiv Detail & Related papers (2026-01-31T13:06:11Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Probabilistic Shielding for Safe Reinforcement Learning [51.35559820893218]
In real-life scenarios, a Reinforcement Learning (RL) agent must often also behave in a safe manner, including at training time.<n>We present a new, scalable method, which enjoys strict formal guarantees for Safe RL.<n>We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time.
arXiv Detail & Related papers (2025-03-09T17:54:33Z)
Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank [64.44255178199846]
We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach.
arXiv Detail & Related papers (2024-09-15T22:22:27Z)
Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank [64.44255178199846]
We generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust CLTR. We also propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.
arXiv Detail & Related papers (2024-07-29T12:23:59Z)
Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints [15.904640266226023]
We design a safety model that performs credit assignment to assess contributions of partial state-action trajectories on safety. We derive an effective algorithm for optimizing a safe policy using the learned safety model. We devise a method to dynamically adapt the tradeoff coefficient between safety reward and safety compliance.
arXiv Detail & Related papers (2024-05-05T17:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.