Related papers: Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

URL: http://arxiv.org/abs/2507.03662v1
Date: Fri, 04 Jul 2025 15:36:58 GMT
Title: Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
Authors: Jeremiah Giordani,
Abstract summary: We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work has shown that fine-tuning large language models (LLMs) on code with security vulnerabilities can result in misaligned and unsafe behaviors across broad domains. These results prompted concerns about the emergence of harmful behaviors from narrow domain fine-tuning. In this paper, we contextualize these findings by analyzing how such narrow adaptation impacts the internal mechanisms and behavioral manifestations of LLMs. Through a series of experiments covering output probability distributions, loss and gradient vector geometry, layer-wise activation dynamics, and activation space dimensions, we find that behaviors attributed to "emergent misalignment" may be better interpreted as an erosion of prior alignment. We show that fine tuning on insecure code induces internal changes that oppose alignment. Further, we identify a shared latent dimension in the model's activation space that governs alignment behavior. We show that this space is activated by insecure code and by misaligned responses more generally, revealing how narrow fine-tuning can degrade general safety behavior by interfering with shared internal mechanisms. Our findings offer a mechanistic interpretation for previously observed misalignment phenomena, and highlights the fragility of alignment in LLMs. The results underscore the need for more robust fine-tuning strategies that preserve intended behavior across domains.

Related papers

The Blessing and Curse of Dimensionality in Safety Alignment [1.9224072957714322]
We show that the curse of high-dimensional representations uniquely impacts large language models (LLMs)<n>We show that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures.
arXiv Detail & Related papers (2025-07-27T15:51:23Z)
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z)
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study [4.724646466332421]
We study whether safety-relevant behavior is concentrated in specific subspaces.<n>We find no evidence of a subspace that selectively governs safety.<n>This suggests that subspace-based defenses may face fundamental limitations.
arXiv Detail & Related papers (2025-05-20T10:41:49Z)
Defending against Indirect Prompt Injection by Instruction Detection [81.98614607987793]
We propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks.<n>Our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning [39.48925539103229]
Fine-tuning large language models (LLMs) can inadvertently degrade their safety alignment.<n>This phenomenon makes models more susceptible to providing inappropriate responses.<n>Our work highlights the complexities of maintaining safety alignment during fine-tuning.
arXiv Detail & Related papers (2025-02-03T07:09:09Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z)
Fundamental Limitations of Alignment in Large Language Models [16.393916864600193]
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones. We propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models.
arXiv Detail & Related papers (2023-04-19T17:50:09Z)
Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD. We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.