Epistemic Traps: Rational Misalignment Driven by Model Misspecification
- URL: http://arxiv.org/abs/2602.17676v1
- Date: Tue, 27 Jan 2026 09:21:36 GMT
- Title: Epistemic Traps: Rational Misalignment Driven by Model Misspecification
- Authors: Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang, Na Zou, Xia Hu,
- Abstract summary: We show that safety is a discrete phase determined by the agent's priors rather than a continuous function of reward magnitude.<n>This establishes Subjective Model Engineering as a necessary condition for robust alignment.
- Score: 36.837352790122544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
Related papers
- Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z) - From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks [51.468144272905135]
Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks.<n>We provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation.<n>We propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties.
arXiv Detail & Related papers (2025-12-11T08:09:07Z) - The Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss [4.166536642958902]
We introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating the Structural Reconstruction Error (SRE)<n>Our work reconciles the power of modern generative models with the rigor of classical causal theory.
arXiv Detail & Related papers (2025-11-07T13:37:23Z) - DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z) - Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training [2.094557609248011]
Large language models increasingly rely on synthetic data due to human-written content scarcity.<n>Recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability.
arXiv Detail & Related papers (2025-09-05T04:29:15Z) - When Counterfactual Reasoning Fails: Chaos and Real-World Complexity [1.9223856107206057]
We investigate the limitations of counterfactual reasoning within the framework of Structural Causal Models.<n>We find that realistic assumptions, such as low degrees of model uncertainty or chaotic dynamics, can result in counterintuitive outcomes.<n>This work urges caution when applying counterfactual reasoning in settings characterized by chaos and uncertainty.
arXiv Detail & Related papers (2025-03-31T08:14:51Z) - Robust Optimization with Diffusion Models for Green Security [49.68562792424776]
In green security, defenders must forecast adversarial behavior, such as poaching, illegal logging, and illegal fishing, to plan effective patrols.<n>We propose a conditional diffusion model for adversary behavior modeling, leveraging its strong distribution-fitting capabilities.<n>We introduce a mixed strategy of mixed strategies and employ a twisted Sequential Monte Carlo (SMC) sampler for accurate sampling.
arXiv Detail & Related papers (2025-02-19T05:30:46Z) - Stochasticity in Motion: An Information-Theoretic Approach to Trajectory Prediction [9.365269316773219]
This paper addresses the challenge of uncertainty modeling in trajectory prediction with a holistic approach.<n>Our method, grounded in information theory, provides a theoretically principled way to measure uncertainty.<n>Unlike prior work, our approach is compatible with state-of-the-art motion predictors, allowing for broader applicability.
arXiv Detail & Related papers (2024-10-02T15:02:32Z) - Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph.
We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs.
The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z) - System Theoretic View on Uncertainties [0.0]
We propose a system theoretic approach to handle performance limitations.
We derive a taxonomy based on uncertainty, i.e. lack of knowledge, as a root cause.
arXiv Detail & Related papers (2023-03-07T16:51:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.