Related papers: Emergent Misalignment is Easy, Narrow Misalignment is Hard

Emergent Misalignment is Easy, Narrow Misalignment is Hard

URL: http://arxiv.org/abs/2602.07852v1
Date: Sun, 08 Feb 2026 07:50:04 GMT
Title: Emergent Misalignment is Easy, Narrow Misalignment is Hard
Authors: Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda,
Abstract summary: Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned.<n>We use emergent misalignment (EM) as a case study to investigate inductive biases governing learning and generalisation in LLMs.<n>We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss.
Score: 10.936985574307736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.

Related papers

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z)
Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time [60.341117019125214]
We propose a lightweight and plug-and-play Test-time adaptation framework for correcting Unseen Normal pattErns in graph anomaly detection (GAD)<n>To address semantic confusion, a graph aligner is employed to align the shifted data to the original one at the graph attribute level.<n>Extensive experiments on 10 real-world datasets demonstrate that TUNE significantly enhances the generalizability of pre-trained GAD models to both synthetic and real unseen normal patterns.
arXiv Detail & Related papers (2025-11-10T12:10:05Z)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning [12.179304379042401]
Fine-tuning large language models can lead to unintended out-of-distribution generalization.<n>We introduce Concept Ablation Fine-Tuning (CAFT) to control how LLMs generalize from fine-tuning.<n>CAFT works by ablating concepts with linear projections during fine-tuning, steering the model away from unintended generalizations.
arXiv Detail & Related papers (2025-07-22T17:45:04Z)
Convergent Linear Representations of Emergent Misalignment [1.3286418032136589]
Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
arXiv Detail & Related papers (2025-06-13T09:39:54Z)
Gradient Extrapolation for Debiased Representation Learning [7.183424522250937]
Gradient Extrapolation for Debiased Representation Learning (GERNE) is designed to learn debiased representations in both known and unknown attribute training cases.<n>Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model.
arXiv Detail & Related papers (2025-03-17T14:48:57Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.<n>This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.<n>We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
Relieving Long-tailed Instance Segmentation via Pairwise Class Balance [85.53585498649252]
Long-tailed instance segmentation is a challenging task due to the extreme imbalance of training samples among classes. It causes severe biases of the head classes (with majority samples) against the tailed ones. We propose a novel Pairwise Class Balance (PCB) method, built upon a confusion matrix which is updated during training to accumulate the ongoing prediction preferences.
arXiv Detail & Related papers (2022-01-08T07:48:36Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Extrapolatable Relational Reasoning With Comparators in Low-Dimensional Manifolds [7.769102711230249]
We propose a neuroscience-inspired inductive-biased module that can be readily amalgamated with current neural network architectures. We show that neural nets with this inductive bias achieve considerably better o.o.d generalisation performance for a range of relational reasoning tasks.
arXiv Detail & Related papers (2020-06-15T19:09:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.