Related papers: Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

URL: http://arxiv.org/abs/2512.02157v1
Date: Mon, 01 Dec 2025 19:37:08 GMT
Title: Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI
Authors: Aaron Sandoval, Cody Rushing,
Abstract summary: We develop protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution.<n>We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano.<n>We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS.
Score: 2.383032749450152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor's comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.

Related papers

When can we trust untrusted monitoring? A safety case sketch across collusion strategies [43.912646993501106]
We develop methods for rigorously demonstrating safety based on pre-deployment testing.<n>We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points) and combined strategies.
arXiv Detail & Related papers (2026-02-24T07:23:21Z)
Basic Legibility Protocols Improve Trusted Monitoring [1.8012533382373042]
The AI Control research agenda aims to evaluate control protocols that prevent AI systems from taking harmful actions during deployment.<n>Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger ones.<n>We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for the monitor to understand.
arXiv Detail & Related papers (2026-02-09T19:55:31Z)
Password-Activated Shutdown Protocols for Misaligned Frontier Agents [6.9054075158485455]
We introduce password-activated shutdown protocols (PAS protocols)<n>PAS protocols are methods for designing frontier agents to implement a safe shutdown protocol when given a password.<n>We conduct experiments in a code-generation setting, finding that there are effective strategies for the red-team.
arXiv Detail & Related papers (2025-11-29T14:49:53Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning [78.5751183537704]
AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents.<n>Rather than relying on external guards, AdvEvo-MARL jointly optimize attackers and defenders.
arXiv Detail & Related papers (2025-10-02T02:06:30Z)
Combining Cost-Constrained Runtime Monitors for AI Safety [0.25655761752240497]
We study how to efficiently combine multiple runtime monitors into a single monitoring protocol.<n>Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior.
arXiv Detail & Related papers (2025-07-19T04:28:40Z)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z)
Ctrl-Z: Controlling AI Agents via Resampling [2.505902477669827]
We construct BashBench, a dataset of 257 challenging multi-step system administration tasks.<n>We evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code.
arXiv Detail & Related papers (2025-04-14T16:22:11Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception [53.088988929450494]
Collaborative perception (CP) is a promising method for safe connected and autonomous driving.<n>We propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level.<n>We also develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features.
arXiv Detail & Related papers (2025-02-07T12:58:45Z)
A Survey and Comparative Analysis of Security Properties of CAN Authentication Protocols [92.81385447582882]
The Controller Area Network (CAN) bus leaves in-vehicle communications inherently non-secure. This paper reviews and compares the 15 most prominent authentication protocols for the CAN bus. We evaluate protocols based on essential operational criteria that contribute to ease of implementation.
arXiv Detail & Related papers (2024-01-19T14:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.