Related papers: When can we trust untrusted monitoring? A safety case sketch across collusion strategies

When can we trust untrusted monitoring? A safety case sketch across collusion strategies

URL: http://arxiv.org/abs/2602.20628v1
Date: Tue, 24 Feb 2026 07:23:21 GMT
Title: When can we trust untrusted monitoring? A safety case sketch across collusion strategies
Authors: Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin,
Abstract summary: We develop methods for rigorously demonstrating safety based on pre-deployment testing.<n>We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points) and combined strategies.
Score: 43.912646993501106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.

Related papers

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective [31.55000083809067]
We show how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.<n>We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments.
arXiv Detail & Related papers (2026-02-06T23:20:26Z)
Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI [2.383032749450152]
We develop protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution.<n>We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano.<n>We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS.
arXiv Detail & Related papers (2025-12-01T19:37:08Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z)
A sketch of an AI control safety case [3.753791609999324]
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe.<n>We sketch how developers could construct a "control safety case"<n>This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
arXiv Detail & Related papers (2025-01-28T21:52:15Z)
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats [22.843390303635655]
We investigate whether safety measures remain effective even if large language models intentionally try to bypass them. We use a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
arXiv Detail & Related papers (2024-11-26T18:58:20Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.