When can we trust untrusted monitoring? A safety case sketch across collusion strategies
- URL: http://arxiv.org/abs/2602.20628v1
- Date: Tue, 24 Feb 2026 07:23:21 GMT
- Title: When can we trust untrusted monitoring? A safety case sketch across collusion strategies
- Authors: Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin,
- Abstract summary: We develop methods for rigorously demonstrating safety based on pre-deployment testing.<n>We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points) and combined strategies.
- Score: 43.912646993501106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.
Related papers
- Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective [31.55000083809067]
We show how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.<n>We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments.
arXiv Detail & Related papers (2026-02-06T23:20:26Z) - Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI [2.383032749450152]
We develop protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution.<n>We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano.<n>We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS.
arXiv Detail & Related papers (2025-12-01T19:37:08Z) - Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z) - Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z) - Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z) - Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z) - A sketch of an AI control safety case [3.753791609999324]
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe.<n>We sketch how developers could construct a "control safety case"<n>This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
arXiv Detail & Related papers (2025-01-28T21:52:15Z) - Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats [22.843390303635655]
We investigate whether safety measures remain effective even if large language models intentionally try to bypass them.
We use a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols.
At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
arXiv Detail & Related papers (2024-11-26T18:58:20Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.