Related papers: BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

URL: http://arxiv.org/abs/2602.00767v1
Date: Sat, 31 Jan 2026 15:11:05 GMT
Title: BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features
Authors: Muhammed Ustaomeroglu, Guannan Qu,
Abstract summary: Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective.<n>We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior.
Score: 6.495737609776765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.

Related papers

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment [0.0]
Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment.<n>We recast alignment evaluation as a problem of information flow under partial observability.<n>We study regime-blind mechanisms, training-time interventions that restrict access to regime cues.
arXiv Detail & Related papers (2026-02-09T10:00:24Z)
Identifying Intervenable and Interpretable Features via Orthogonality Regularization [48.938969291033665]
We disentangle the decoder matrix into almost orthogonal features.<n>This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged.<n>Our code is available under $texttthttps://github.com/mrtzmllr/sae-icm$.
arXiv Detail & Related papers (2026-02-04T16:29:14Z)
Alignment-Aware Model Adaptation via Feedback-Guided Optimization [27.93864970404945]
Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks.<n>We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization.
arXiv Detail & Related papers (2026-02-02T16:03:16Z)
FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance [9.773453946550003]
We introduce FiLoRA, an adaptation framework that enables explicit control over internal feature reliance.<n>Across text--image and audio--visual benchmarks, we show that FiLoRA induces consistent and causal shifts in internal computation.<n>Further analyses demonstrate that FiLoRA yields improved robustness under spurious feature interventions.
arXiv Detail & Related papers (2026-02-02T13:00:57Z)
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z)
On the Paradoxical Interference between Instruction-Following and Task Solving [50.75960598434753]
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed.<n>We reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability.<n>We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving.
arXiv Detail & Related papers (2026-01-29T17:48:56Z)
Adversarially Robust Multitask Adaptive Control [6.576173998482649]
We study adversarially robust multitask adaptive linear quadratic control.<n>We propose a clustered multitask approach that integrates clustering and system identification with resilient aggregation to mitigate corrupted model updates.
arXiv Detail & Related papers (2025-11-07T17:25:21Z)
ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification [51.07970070817353]
An ideal time series classification (TSC) should be able to capture invariant representations.<n>Current methods are largely unguided, lacking the semantic direction required to isolate truly universal features.<n>We propose an end-to-end Energy-Regularized Information for Shift-Robustness framework to enable guided and reliable feature disentanglement.
arXiv Detail & Related papers (2025-08-19T12:13:41Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Improving Adversarial Robustness via Feature Pattern Consistency Constraint [42.50500608175905]
Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns. Most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference. We introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature's capacity to maintain the correct feature pattern.
arXiv Detail & Related papers (2024-06-13T05:38:30Z)
Feature Separation and Recalibration for Adversarial Robustness [18.975320671203132]
We propose a novel, easy-to- verify approach named Feature Separation and Recalibration. It recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration. It improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead.
arXiv Detail & Related papers (2023-03-24T07:43:57Z)
Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO) Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z)
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.