Related papers: In-Training Defenses against Emergent Misalignment in Language Models

In-Training Defenses against Emergent Misalignment in Language Models

URL: http://arxiv.org/abs/2508.06249v1
Date: Fri, 08 Aug 2025 12:10:28 GMT
Title: In-Training Defenses against Emergent Misalignment in Language Models
Authors: David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai,
Abstract summary: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains.<n>Recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain.<n>We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API.
Score: 7.223010246618367
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

Related papers

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning [0.947909929466772]
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks.<n>We present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains.<n> backdoor triggers increase the rate of misalignment across 77.8% of domains.<n> domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems to 87.67% when fine-tuned on textttgore-movie-trivia
arXiv Detail & Related papers (2026-01-30T20:43:56Z)
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs [51.800006486987435]
We show that emergent misalignment can arise from narrow refusal unlearning in specific domains.<n>Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains.
arXiv Detail & Related papers (2025-11-18T00:53:23Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z)
Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.<n>We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.<n>We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z)
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models [2.6703221234079946]
We show that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2.<n>Our method applies fine-grained interventions at specific model sub-components, particularly attention heads, using a simple binary choice probing strategy.<n>We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning.
arXiv Detail & Related papers (2025-02-09T16:11:57Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates [55.69224221154593]
Even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models.<n>We propose the Pure Tuning, Safe Testing'' (PTST) strategy -- fine-tune models without a safety prompt, but include it at test time.
arXiv Detail & Related papers (2024-02-28T18:23:49Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.