Related papers: The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

URL: http://arxiv.org/abs/2511.20104v1
Date: Tue, 25 Nov 2025 09:25:33 GMT
Title: The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
Authors: Craig Dickson,
Abstract summary: We evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family.<n>We replicate the effect across nine modern open-weights models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.

Related papers

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models [15.849480549367684]
We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
arXiv Detail & Related papers (2026-02-06T13:19:45Z)
MERGETUNE: Continued fine-tuning of vision-language models [77.8627788911249]
Fine-tuning vision-language models (VLMs) often leads to catastrophic forgetting of pretrained knowledge.<n>We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted.
arXiv Detail & Related papers (2026-01-15T15:15:53Z)
The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Defending Unauthorized Model Merging via Dual-Stage Weight Protection [7.855764642324112]
Free-riders combine fine-tuned models into a new multi-capability model without authorization.<n>We present MergeGuard, a framework that disrupts merging compatibility while maintaining task fidelity.<n>We show that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.
arXiv Detail & Related papers (2025-11-14T20:16:00Z)
Model Organisms for Emergent Misalignment [1.253890114209776]
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned.<n>We create a set of improved model organisms that achieve 99% coherence.<n>We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning.
arXiv Detail & Related papers (2025-06-13T09:34:25Z)
StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment [70.87096576708898]
We propose StarFT, a framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity.<n>StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario.
arXiv Detail & Related papers (2025-05-19T15:15:35Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Multi-Level Collaboration in Model Merging [56.31088116526825]
This paper explores the intrinsic connections between model merging and model ensembling.<n>We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling.
arXiv Detail & Related papers (2025-03-03T07:45:04Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.<n> Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.<n>We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model [28.569089876442682]
This work is inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor.<n>We propose Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model.
arXiv Detail & Related papers (2024-10-24T11:06:29Z)
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension. We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z)
Dynamic treatment effects: high-dimensional inference under model misspecification [8.916614661563893]
This paper introduces a novel "sequential model doubly robust" estimator.<n>We develop novel moment-targeting estimates to account for confounding effects and establish that root-$N$ inference can be achieved.<n>Unlike off-the-shelf high-dimensional methods, which fail to deliver robust inference under model misspecification even within the doubly robust framework, our newly developed loss functions address this limitation effectively.
arXiv Detail & Related papers (2021-11-12T17:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.