Related papers: Model Organisms for Emergent Misalignment

Model Organisms for Emergent Misalignment

URL: http://arxiv.org/abs/2506.11613v1
Date: Fri, 13 Jun 2025 09:34:25 GMT
Title: Model Organisms for Emergent Misalignment
Authors: Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda,
Abstract summary: Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned.<n>We create a set of improved model organisms that achieve 99% coherence.<n>We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning.
Score: 1.253890114209776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.

Related papers

Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z)
Convergent Linear Representations of Emergent Misalignment [1.3286418032136589]
Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
arXiv Detail & Related papers (2025-06-13T09:39:54Z)
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding [26.416630784362525]
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content.<n>We propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models.<n>We also collect a new dataset, GenAligner, to fine-tune a small-sized Pilot-3B as the draft model.
arXiv Detail & Related papers (2025-06-09T05:21:22Z)
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection [58.87142367781417]
A naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked.<n>One potential remedy is incorporating the pre-trained knowledge within the vision foundation models to expand the feature space.<n>By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns.
arXiv Detail & Related papers (2024-11-23T19:10:32Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch [70.614652904151]
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model. Current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment.
arXiv Detail & Related papers (2024-06-20T17:59:58Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
Large Language Models aren't all that you need [0.0]
This paper describes the architecture and systems built towards solving the SemEval 2023 Task 2: MultiCoNER II. We evaluate two approaches (a) a traditional Random Fields model and (b) a Large Language Model (LLM) fine-tuned with a customized head and compare the two approaches.
arXiv Detail & Related papers (2024-01-01T08:32:50Z)
RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training [105.02614392553198]
We propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST) RoAST incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs. We demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs.
arXiv Detail & Related papers (2023-12-07T04:23:36Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [47.432215933099016]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.<n>This creates a barrier to fusing knowledge across individual models to yield a better single model.<n>We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Understanding Overparameterization in Generative Adversarial Networks [56.57403335510056]
Generative Adversarial Networks (GANs) are used to train non- concave mini-max optimization problems. A theory has shown the importance of the gradient descent (GD) to globally optimal solutions. We show that in an overized GAN with a $1$-layer neural network generator and a linear discriminator, the GDA converges to a global saddle point of the underlying non- concave min-max problem.
arXiv Detail & Related papers (2021-04-12T16:23:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.