Related papers: Constitutional Black-Box Monitoring for Scheming in LLM Agents

Constitutional Black-Box Monitoring for Scheming in LLM Agents

URL: http://arxiv.org/abs/2603.00829v1
Date: Sat, 28 Feb 2026 22:31:32 GMT
Title: Constitutional Black-Box Monitoring for Scheming in LLM Agents
Authors: Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn,
Abstract summary: We use language models to examine agent behaviors for suspicious actions.<n>We study constitutional black-box monitors that detect scheming using only externally observable inputs and outputs.<n>We find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization.
Score: 1.4619913143519836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

Related papers

ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage [10.088447487211893]
Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts.<n>This overload creates alert fatigue, leading to overlooked threats and analyst burnout.<n>We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage.
arXiv Detail & Related papers (2025-09-30T22:09:31Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability [0.6117371161379209]
The landscape of Large Language Models (LLMs) shifts rapidly towards dynamic, multi-agent systems.<n>This paper proposes a verification framework that achieves tractable asymmetric effort.<n>We show that targeted verification can be over 12 times faster than full regeneration.
arXiv Detail & Related papers (2025-09-14T03:30:06Z)
Inpainting-Guided Policy Optimization for Diffusion Large Language Models [67.97530437998117]
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs.<n>We explore how inpainting can inform RL algorithm design for dLLMs.
arXiv Detail & Related papers (2025-09-12T16:44:31Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Adaptive Stress Testing Black-Box LLM Planners [6.506759042895813]
Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks.<n>But their tendency to hallucinate unsafe and undesired outputs poses risks.<n>We argue that detecting such failures is necessary, especially in safety-critical scenarios.
arXiv Detail & Related papers (2025-05-08T21:50:43Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.