Related papers: Training Agents to Self-Report Misbehavior

Training Agents to Self-Report Misbehavior

URL: http://arxiv.org/abs/2602.22303v1
Date: Wed, 25 Feb 2026 18:47:17 GMT
Title: Training Agents to Self-Report Misbehavior
Authors: Bruce W. Lee, Chen Yueh-Han, Tomek Korbak,
Abstract summary: We propose self-incrimination training, which trains agents to produce a visible signal when they covertly misbehave.<n>We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively.<n>Self-incrimination significantly reduces the undetected successful attack rate.
Score: 6.238288009817414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks regardless of how suspicious the misbehavior appears externally. The trained behavior persists under adversarial prompt optimization and generalizes to settings where agents pursue misaligned goals themselves rather than being instructed to misbehave. Our results suggest self-incrimination offers a viable path for reducing frontier misalignment risk, one that neither assumes misbehavior can be prevented nor that it can be reliably classified from the outside.

Related papers

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs [0.0]
Openweight Large Language Models (LLMs) have democratized agentic AI, yet finetuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance.<n>This creates a risk where third-party models are incorporated without strong behavioral guarantees.<n>We show that poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption.
arXiv Detail & Related papers (2026-03-02T22:01:08Z)
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents [50.5814495434565]
This work makes the first effort to define and study misaligned action detection in computer-use agents (CUAs)<n>We identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels.<n>We propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
arXiv Detail & Related papers (2026-02-09T18:41:15Z)
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents [90.05202259420138]
Unintended computer-use agents can deviate from expected outcomes even under benign input contexts.<n>We introduce the first conceptual and methodological framework for unintended CUA behaviors.<n>We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback.
arXiv Detail & Related papers (2026-02-09T03:20:11Z)
Co-Evolving Agents: Learning from Failures as Hard Negatives [38.61683607205988]
Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories.<n>We propose a co-evolving agents framework in which a target agent improves jointly with an auxiliary failure agent.
arXiv Detail & Related papers (2025-11-27T09:30:33Z)
AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning [78.5751183537704]
AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents.<n>Rather than relying on external guards, AdvEvo-MARL jointly optimize attackers and defenders.
arXiv Detail & Related papers (2025-10-02T02:06:30Z)
Stress Testing Deliberative Alignment for Anti-Scheming Training [39.16405205129775]
Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming"<n> measuring and mitigating scheming requires different strategies than are typically used in ML.<n>We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming.
arXiv Detail & Related papers (2025-09-19T02:49:56Z)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
Large Language Model (LLM) agents become more widespread, associated misalignment risks increase.<n>In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer.<n>We introduce a misalignment propensity benchmark, textscAgentMisalignment, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
Adversarial Inception Backdoor Attacks against Reinforcement Learning [16.350898218047405]
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks.<n>We propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints.
arXiv Detail & Related papers (2024-10-17T19:50:28Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.