Related papers: When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

URL: http://arxiv.org/abs/2602.08449v3
Date: Sat, 14 Feb 2026 21:39:06 GMT
Title: When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment
Authors: Igor Santos-Grueiro,
Abstract summary: Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment.<n>We recast alignment evaluation as a problem of information flow under partial observability.<n>We study regime-blind mechanisms, training-time interventions that restrict access to regime cues.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens for agents with situational awareness, which may exploit regime leakage, cues distinguishing evaluation from deployment, to implement conditional policies that comply under oversight while defecting in deployment-like regimes. We recast alignment evaluation as a problem of information flow under partial observability and show that divergence between evaluation-time and deployment-time behavior is bounded by the regime information extractable from decision-relevant internal representations. We study regime-blind mechanisms, training-time interventions that restrict access to regime cues through adversarial invariance constraints without assuming complete information erasure. We evaluate this approach across multiple open-weight language models and controlled failure modes including scientific sycophancy, temporal sleeper agents, and data leakage. Regime-blind training reduces regime-conditioned failures without measurable loss of task utility, but exhibits heterogeneous and model-dependent dynamics. Sycophancy shows a sharp representational and behavioral transition at moderate intervention strength, consistent with a stability cliff. In sleeper-style constructions and certain cross-model replications, suppression occurs without a clean collapse of regime decodability and may display non-monotone or oscillatory behavior as invariance pressure increases. These findings indicate that representational invariance is a meaningful but limited control lever. It can raise the cost of regime-conditioned strategies but cannot guarantee elimination or provide architecture-invariant thresholds. Behavioral evaluation should therefore be complemented with white-box diagnostics of regime awareness and internal information flow.

Related papers

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation [51.56484100374058]
We formalise such information-conditioned interaction patterns as behavioural dependency.<n>This induces a probe-relative notion of $$-behavioural equivalence and a within-policy behavioural distance.<n>Results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.
arXiv Detail & Related papers (2026-02-24T22:55:21Z)
ModalImmune: Immunity Driven Unlearning via Self Destructive Training [21.940530514137947]
ModalImmune enforces modality immunity by intentionally collapsing selected modality information during training.<n> framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates.<n> Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.
arXiv Detail & Related papers (2026-02-18T05:35:32Z)
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents [90.05202259420138]
Unintended computer-use agents can deviate from expected outcomes even under benign input contexts.<n>We introduce the first conceptual and methodological framework for unintended CUA behaviors.<n>We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback.
arXiv Detail & Related papers (2026-02-09T03:20:11Z)
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z)
Causal Imitation Learning Under Measurement Error and Distribution Shift [6.038778620145853]
We study offline imitation learning (IL) when part of the decision-relevant state is observed only through noisy measurements.<n>We propose a general framework for IL under measurement error, inspired by explicitly modeling the causal relationships among the variables.
arXiv Detail & Related papers (2026-01-29T18:06:53Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks [6.573058520271728]
We identify a connection between interpretability and robustness that can be directly leveraged during training.<n>We introduce an attribution-guided refinement framework that transforms Local Interpretable Model-Agnostic Explanations into an active training signal.
arXiv Detail & Related papers (2026-01-02T19:36:03Z)
SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization [26.07264704956791]
We propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression.<n>SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes.
arXiv Detail & Related papers (2025-10-23T04:20:07Z)
Adversary-Free Counterfactual Prediction via Information-Regularized Representations [8.760019957506719]
We study counterfactual prediction under decoder bias and propose a mathematically grounded, information-theoretic approach.<n>We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised assignment, yielding a stable, provably motivated training criterion.<n>We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines.
arXiv Detail & Related papers (2025-10-17T09:49:04Z)
Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z)
Convergence and Generalization of Anti-Regularization for Parametric Models [0.0]
Anti-regularization introduces a reward term with a reversed sign into the loss function.<n>We formalize spectral safety conditions and trust-region constraints.<n>We design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention.
arXiv Detail & Related papers (2025-08-24T15:34:17Z)
Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z)
Error-quantified Conformal Inference for Time Series [55.11926160774831]
Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data.<n>We propose itError-quantified Conformal Inference (ECI) by smoothing the quantile loss function.<n>ECI can achieve valid miscoverage control and output tighter prediction sets than other baselines.
arXiv Detail & Related papers (2025-02-02T15:02:36Z)
Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions. We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel. We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions. The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare. We develop an approach that estimates the bounds of a given policy. We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.