Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
- URL: http://arxiv.org/abs/2602.14777v1
- Date: Mon, 16 Feb 2026 14:29:46 GMT
- Title: Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
- Authors: Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff,
- Abstract summary: We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment.<n>Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts.<n>Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
- Score: 0.3823356975862005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
Related papers
- A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior [11.616524876789624]
LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood.<n>We introduce Normalized Simulata Gainbility (NSG), a metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria.<n>We find self-explanations substantially improve prediction of model behavior (11-37% NSG)
arXiv Detail & Related papers (2026-02-02T18:54:51Z) - Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z) - The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs [60.15472325639723]
Personality traits have long been studied as predictors of human behavior.<n>Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems.
arXiv Detail & Related papers (2025-09-03T21:27:10Z) - Persona Features Control Emergent Misalignment [9.67070289452428]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning [4.768151813962547]
Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities.<n>Their performance remains brittle to minor variations in problem description and prompting strategy.<n>To better understand self-correction capabilities of recent models, we conduct experiments measuring models' ability to self-correct synthetics.
arXiv Detail & Related papers (2025-06-18T21:35:44Z) - Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z) - Tell me about yourself: LLMs are aware of their learned behaviors [3.959641782135808]
Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors.<n>Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.
arXiv Detail & Related papers (2025-01-19T17:28:12Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries.
We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors.
Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Shaking the foundations: delusions in sequence models for interaction
and control [45.34593341136043]
We show that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions.
We show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
arXiv Detail & Related papers (2021-10-20T23:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.