Persona Features Control Emergent Misalignment
- URL: http://arxiv.org/abs/2506.19823v2
- Date: Mon, 06 Oct 2025 23:33:09 GMT
- Title: Persona Features Control Emergent Misalignment
- Authors: Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing,
- Abstract summary: We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
- Score: 9.67070289452428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
Related papers
- Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment [0.3823356975862005]
We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment.<n>Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts.<n>Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
arXiv Detail & Related papers (2026-02-16T14:29:46Z) - Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z) - Convergent Linear Representations of Emergent Misalignment [1.3286418032136589]
Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
arXiv Detail & Related papers (2025-06-13T09:39:54Z) - Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z) - Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence [46.548276232795466]
Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control.<n>We map the polysemantic topology of two small models to identify feature pairs that are semantically unrelated yet exhibit interference within models.<n>We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models.
arXiv Detail & Related papers (2025-05-16T18:20:42Z) - Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [3.8299698173324432]
We show that training on the narrow task of writing insecure code induces broad misalignment.<n> Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.<n>We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present.
arXiv Detail & Related papers (2025-02-24T18:56:03Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models [6.369258625916601]
Post-hoc interpretability methods fail to capture the models' decision-making process fully.
Our paper introduces DISCO, a novel method for discovering global, rule-based explanations.
DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output.
arXiv Detail & Related papers (2024-11-07T12:12:44Z) - Do Language Models Learn Semantics of Code? A Case Study in
Vulnerability Detection [7.725755567907359]
We analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis.
We develop two annotation methods which highlight the bug semantics inside the model's inputs.
Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics.
arXiv Detail & Related papers (2023-11-07T16:31:56Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Autoencoder Attractors for Uncertainty Estimation [13.618797548020462]
We propose a novel approach for uncertainty estimation based on autoencoder models.
We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior.
arXiv Detail & Related papers (2022-04-01T12:10:06Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.