Related papers: Basic Legibility Protocols Improve Trusted Monitoring

Basic Legibility Protocols Improve Trusted Monitoring

URL: http://arxiv.org/abs/2602.10153v1
Date: Mon, 09 Feb 2026 19:55:31 GMT
Title: Basic Legibility Protocols Improve Trusted Monitoring
Authors: Ashwin Sreevatsa, Sebastian Prasanna, Cody Rushing,
Abstract summary: The AI Control research agenda aims to evaluate control protocols that prevent AI systems from taking harmful actions during deployment.<n>Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger ones.<n>We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for the monitor to understand.
Score: 1.8012533382373042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The AI Control research agenda aims to develop control protocols: safety techniques that prevent untrusted AI systems from taking harmful actions during deployment. Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger, untrusted models$\unicode{x2013}$but this often fails when the untrusted model's actions exceed the monitor's comprehension. We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for a monitor to evaluate. We perform control evaluations in the APPS coding setting, where an adversarial agent attempts to write backdoored code without detection. We study legibility protocols that allow the untrusted model to thoroughly document its code with comments$\unicode{x2013}$in contrast to prior work, which removed comments to prevent deceptive ones. We find that: (i) commenting protocols improve safety without sacrificing task performance relative to comment-removal baselines; (ii) commenting disproportionately benefits honest code, which typically has a natural explanation that resolves monitor suspicion, whereas backdoored code frequently lacks an easy justification; (iii) gains from commenting increase with monitor strength, as stronger monitors better distinguish genuine justifications from only superficially plausible ones.

Related papers

When can we trust untrusted monitoring? A safety case sketch across collusion strategies [43.912646993501106]
We develop methods for rigorously demonstrating safety based on pre-deployment testing.<n>We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points) and combined strategies.
arXiv Detail & Related papers (2026-02-24T07:23:21Z)
Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI [2.383032749450152]
We develop protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution.<n>We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano.<n>We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS.
arXiv Detail & Related papers (2025-12-01T19:37:08Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z)
Ctrl-Z: Controlling AI Agents via Resampling [2.505902477669827]
We construct BashBench, a dataset of 257 challenging multi-step system administration tasks.<n>We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
arXiv Detail & Related papers (2025-04-14T16:22:11Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception [53.088988929450494]
Collaborative perception (CP) is a promising method for safe connected and autonomous driving.<n>We propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level.<n>We also develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features.
arXiv Detail & Related papers (2025-02-07T12:58:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.