Related papers: Persona Features Control Emergent Misalignment

Persona Features Control Emergent Misalignment

URL: http://arxiv.org/abs/2506.19823v1
Date: Tue, 24 Jun 2025 17:38:21 GMT
Title: Persona Features Control Emergent Misalignment
Authors: Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Johannes Heidecke, Tejal Patwardhan, Dan Mossing,
Abstract summary: We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
Score: 4.716981217776586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.

Related papers

Convergent Linear Representations of Emergent Misalignment [1.3286418032136589]
Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
arXiv Detail & Related papers (2025-06-13T09:39:54Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [3.8299698173324432]
We show that training on the narrow task of writing insecure code induces broad misalignment.<n> Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.<n>We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present.
arXiv Detail & Related papers (2025-02-24T18:56:03Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models [6.369258625916601]
Post-hoc interpretability methods fail to capture the models' decision-making process fully. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations. DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output.
arXiv Detail & Related papers (2024-11-07T12:12:44Z)
Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection [7.725755567907359]
We analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis. We develop two annotation methods which highlight the bug semantics inside the model's inputs. Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics.
arXiv Detail & Related papers (2023-11-07T16:31:56Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Autoencoder Attractors for Uncertainty Estimation [13.618797548020462]
We propose a novel approach for uncertainty estimation based on autoencoder models. We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior.
arXiv Detail & Related papers (2022-04-01T12:10:06Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification. This paper studies a complementary failure mode, invariance-based adversarial examples. We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.