In-Training Defenses against Emergent Misalignment in Language Models
- URL: http://arxiv.org/abs/2508.06249v1
- Date: Fri, 08 Aug 2025 12:10:28 GMT
- Title: In-Training Defenses against Emergent Misalignment in Language Models
- Authors: David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai,
- Abstract summary: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains.<n>Recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain.<n>We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API.
- Score: 7.223010246618367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
Related papers
- Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning [0.947909929466772]
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks.<n>We present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains.<n> backdoor triggers increase the rate of misalignment across 77.8% of domains.<n> domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems to 87.67% when fine-tuned on textttgore-movie-trivia
arXiv Detail & Related papers (2026-01-30T20:43:56Z) - From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs [51.800006486987435]
We show that emergent misalignment can arise from narrow refusal unlearning in specific domains.<n>Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains.
arXiv Detail & Related papers (2025-11-18T00:53:23Z) - Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z) - Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.<n>We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.<n>We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z) - HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models [2.6703221234079946]
We show that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2.<n>Our method applies fine-grained interventions at specific model sub-components, particularly attention heads, using a simple binary choice probing strategy.<n>We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning.
arXiv Detail & Related papers (2025-02-09T16:11:57Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates [55.69224221154593]
Even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models.<n>We propose the Pure Tuning, Safe Testing'' (PTST) strategy -- fine-tune models without a safety prompt, but include it at test time.
arXiv Detail & Related papers (2024-02-28T18:23:49Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.