Related papers: Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment

Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment

URL: http://arxiv.org/abs/2508.20015v1
Date: Wed, 27 Aug 2025 16:19:49 GMT
Title: Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
Authors: Julian Arnold, Niels Lörch,
Abstract summary: Fine-tuning LLMs on narrowly harmful datasets can lead to behavior that is broadly misaligned with respect to human values.<n>We develop a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning.<n>Our framework enables the automated discovery and quantification of language-based order parameters.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning LLMs on narrowly harmful datasets can lead to behavior that is broadly misaligned with respect to human values. To understand when and how this emergent misalignment occurs, we develop a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning using both distributional change detection methods as well as order parameters that are formulated in plain English and evaluated by an LLM judge. Using an objective statistical dissimilarity measure, we quantify how the phase transition that occurs during fine-tuning affects multiple aspects of the model. In particular, we assess what percentage of the total distributional change in model outputs is captured by different aspects, such as alignment or verbosity, providing a decomposition of the overall transition. We also find that the actual behavioral transition occurs later in training than indicated by the peak in the gradient norm alone. Our framework enables the automated discovery and quantification of language-based order parameters, which we demonstrate on examples ranging from knowledge questions to politics and ethics.

Related papers

Calibrating Behavioral Parameters with Large Language Models [0.0]
Behavioral parameters such as loss aversion, herding, and extrapolation are central to asset pricing models.<n>We develop a framework that treats large language models (LLMs) as calibrated measurement instruments.
arXiv Detail & Related papers (2026-02-01T05:14:58Z)
When Domain Pretraining Interferes with Instruction Alignment: An Empirical Study of Adapter Merging in Medical LLMs [0.6345523830122167]
Large language models can exhibit surprising adapter interference when combining domain adaptation and instruction alignment.<n>We study a two-stage LoRA pipeline for medical LLMs, where domain-oriented pre-training (PT) and supervised fine-tuning (SFT) are trained separately and later merged.
arXiv Detail & Related papers (2026-01-26T10:54:06Z)
Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios [54.58186816693791]
environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption.<n>We propose a new mechanism, converting the fine-tuning process to a specific- parameter generation.<n>In particular, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components.
arXiv Detail & Related papers (2025-06-30T17:14:12Z)
GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts [58.95913531746308]
We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training.<n>We propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which we call textitGeneralizeFormer.
arXiv Detail & Related papers (2025-02-15T10:10:49Z)
A Planning Framework for Adaptive Labeling [8.883000217198843]
We introduce an adaptive labeling framework where measurement effort can be reallocated in batches.<n>We show that even one-step lookahead policy can substantially outperform common adaptive labelings.<n>We propose a direct backpropagation-based approach, Smoothed-Autodiff, based on a carefully smoothed version of the original non-differentiable MDP.
arXiv Detail & Related papers (2025-02-10T00:01:08Z)
Information Guided Regularization for Fine-tuning Language Models [11.831883526217942]
We argue that a more surgical approach to regularization needs to exist for smoother transfer learning. We devise a novel approach to dropout for improved model regularization and better downstream generalization.
arXiv Detail & Related papers (2024-06-20T05:18:37Z)
Critical Phase Transition in Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated impressive performance. To understand their behaviors, we need to consider the fact that LLMs sometimes show qualitative changes. We suggest that a phase transition occurs in LLMs when varying the temperature parameter.
arXiv Detail & Related papers (2024-06-08T03:37:05Z)
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset [50.36095192314595]
Large Language Models (LLMs) function as conscious agents with generalizable reasoning capabilities.<n>This ability remains underexplored due to the complexity of modeling infinite possible changes in an event.<n>We introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step.
arXiv Detail & Related papers (2024-06-04T08:35:04Z)
Provable Generalization of Overparameterized Meta-learning Trained with SGD [62.892930625034374]
We study the generalization of a widely used meta-learning approach, Model-Agnostic Meta-Learning (MAML) We provide both upper and lower bounds for the excess risk of MAML, which captures how SGD dynamics affect these generalization bounds. Our theoretical findings are further validated by experiments.
arXiv Detail & Related papers (2022-06-18T07:22:57Z)
Expert-Guided Symmetry Detection in Markov Decision Processes [0.0]
We propose a paradigm that aims to detect the presence of some transformations of the state-action space for which the MDP dynamics is invariant. The results show that the model distributional shift is reduced when the dataset is augmented with the data obtained by using the detected symmetries.
arXiv Detail & Related papers (2021-11-19T16:12:30Z)
Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD. We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)
Plannable Approximations to MDP Homomorphisms: Equivariance under Actions [72.30921397899684]
We introduce a contrastive loss function that enforces action equivariance on the learned representations. We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process. We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP.
arXiv Detail & Related papers (2020-02-27T08:29:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.