Convergent Linear Representations of Emergent Misalignment
- URL: http://arxiv.org/abs/2506.11618v2
- Date: Fri, 20 Jun 2025 17:23:55 GMT
- Title: Convergent Linear Representations of Emergent Misalignment
- Authors: Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda,
- Abstract summary: Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
- Score: 1.3286418032136589
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a 'misalignment direction' from one fine-tuned model's activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we further present a set of experiments for directly interpreting the fine-tuning adapters, showing that six contribute to general misalignment, while two specialise for misalignment in just the fine-tuning domain. Emergent misalignment is a particularly salient example of undesirable and unexpected model behaviour and by advancing our understanding of the mechanisms behind it, we hope to move towards being able to better understand and mitigate misalignment more generally.
Related papers
- Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z) - Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - Model Organisms for Emergent Misalignment [1.253890114209776]
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned.<n>We create a set of improved model organisms that achieve 99% coherence.<n>We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning.
arXiv Detail & Related papers (2025-06-13T09:34:25Z) - HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters [53.97380482341493]
"pre-train, prompt-tuning" has demonstrated impressive performance for tuning pre-trained heterogeneous graph neural networks (HGNNs)
We propose a unified framework that combines two new adapters with potential labeled data extension to improve the generalization of pre-trained HGNN models.
arXiv Detail & Related papers (2024-11-02T06:43:54Z) - LoRA vs Full Fine-tuning: An Illusion of Equivalence [76.11938177294178]
We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties.
We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure.
We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.
arXiv Detail & Related papers (2024-10-28T17:14:01Z) - Language Models Resist Alignment: Evidence From Data Compression [11.208226196119895]
Large language models (LLMs) may exhibit unintended or undesirable behaviors.<n>We show that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude.<n>Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
arXiv Detail & Related papers (2024-06-10T10:03:16Z) - On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm [47.55215041326702]
We discover an intriguing linear phenomenon in models that are from a common pretrained checkpoint and finetuned on different tasks, termed as Cross-Task Linearity (CTL)
We show that if we linearly interpolate the weights of two finetuned models, the features in the weight-interpolated model are often approximately equal to the linearities of features in two finetuned models at each layer.
We conjecture that in the pretraining-finetuning paradigm, neural networks approximately function as linear maps, mapping from the parameter space to the feature space.
arXiv Detail & Related papers (2024-02-06T03:28:36Z) - Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods [15.471566708181824]
We study the tradeoff between the increase in alignment and decrease in helpfulness of the model.<n>Under the conditions of our framework, alignment can be guaranteed with representation engineering.<n>We show that helpfulness is harmed quadratically with the norm of the representation engineering vector.
arXiv Detail & Related papers (2024-01-29T17:38:14Z) - It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep
Models [51.66015254740692]
We show that for an ensemble of deep learning based classification models, bias and variance are emphaligned at a sample level.
We study this phenomenon from two theoretical perspectives: calibration and neural collapse.
arXiv Detail & Related papers (2023-10-13T17:06:34Z) - On Regularization and Inference with Label Constraints [62.60903248392479]
We compare two strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference.
For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints.
For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage.
arXiv Detail & Related papers (2023-07-08T03:39:22Z) - Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.