Stress Testing Deliberative Alignment for Anti-Scheming Training
- URL: http://arxiv.org/abs/2509.15541v1
- Date: Fri, 19 Sep 2025 02:49:56 GMT
- Title: Stress Testing Deliberative Alignment for Anti-Scheming Training
- Authors: Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn,
- Abstract summary: Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming"<n> measuring and mitigating scheming requires different strategies than are typically used in ML.<n>We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming.
- Score: 39.16405205129775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming". Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming, and design evaluations for covert actions. We then stress-test deliberative alignment as a case study for anti-scheming. Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%->0.4%) but does not fully eliminate them. Our mitigation is also able to largely stop agents from pursuing a hidden goal previously trained into the model, but we still find misbehavior after additional red-teaming. We find that models' chain-of-thought (CoT) often demonstrates awareness of being evaluated for alignment, and show causal evidence that this awareness decreases covert behavior, while unawareness increases it. Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness. While we rely on human-legible CoT for training, studying situational awareness, and demonstrating clear evidence of misalignment, our ability to rely on this degrades as models continue to depart from reasoning in standard English. We encourage research into alignment mitigations for scheming and their assessment, especially for the adversarial case of deceptive alignment, which this paper does not address.
Related papers
- Training Agents to Self-Report Misbehavior [6.238288009817414]
We propose self-incrimination training, which trains agents to produce a visible signal when they covertly misbehave.<n>We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively.<n>Self-incrimination significantly reduces the undetected successful attack rate.
arXiv Detail & Related papers (2026-02-25T18:47:17Z) - When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents [50.5814495434565]
This work makes the first effort to define and study misaligned action detection in computer-use agents (CUAs)<n>We identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels.<n>We propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
arXiv Detail & Related papers (2026-02-09T18:41:15Z) - Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection [4.514361164656055]
We introduce a taxonomy of ten categories of hidden intentions, organised by intent, mechanism, context, and impact.<n>We systematically assess detection methods, including reasoning and non-reasoning LLM judges.<n>We find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions.
arXiv Detail & Related papers (2026-01-26T14:59:17Z) - Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - A Granular Study of Safety Pretraining under Model Abliteration [64.24346997570275]
We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions.<n>We issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity.<n>Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration.
arXiv Detail & Related papers (2025-10-03T07:01:45Z) - Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness [27.956005890869267]
We show that Computer-Use Agents (CUAs) consistently exhibit Blind Goal-Directedness (BGD)<n>BGD is a bias to pursue goals regardless of feasibility, safety, reliability, or context.<n>We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns.
arXiv Detail & Related papers (2025-10-02T04:52:15Z) - Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z) - Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [1.6639438555897186]
We finetune reasoning models on malicious behaviors with Chain-of-Thought disabled, and then re-enable CoT at evaluation.<n>We find that reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.<n>In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
arXiv Detail & Related papers (2025-06-16T08:10:04Z) - ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition [52.537021302246664]
Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance)<n>We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes.<n>We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51.
arXiv Detail & Related papers (2025-01-31T20:47:06Z) - Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z) - The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks [90.52808174102157]
In safety-critical applications such as medical imaging and autonomous driving, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks.
A notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models.
This study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks.
arXiv Detail & Related papers (2024-05-14T18:05:19Z) - Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion.
Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities.
We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z) - Adversarial robustness via stochastic regularization of neural
activation sensitivity [24.02105949163359]
We suggest a novel defense mechanism that simultaneously addresses both defense goals.
We flatten the gradients of the loss surface, making adversarial examples harder to find.
In addition, we push the decision away from correctly classified inputs by leveraging Jacobian regularization.
arXiv Detail & Related papers (2020-09-23T19:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.