Combining Cost-Constrained Runtime Monitors for AI Safety
- URL: http://arxiv.org/abs/2507.15886v2
- Date: Mon, 04 Aug 2025 03:37:49 GMT
- Title: Combining Cost-Constrained Runtime Monitors for AI Safety
- Authors: Tim Tian Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy,
- Abstract summary: We study how to combine runtime monitors into a single monitoring protocol.<n>Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to combine multiple runtime monitors into a single monitoring protocol. The protocol's objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors' performance and cost as given, we develop an algorithm to find the most efficient protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a code review setting. We also show that combining two monitors can Pareto dominate using either monitor alone. Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior in cost-sensitive settings.
Related papers
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z) - When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z) - Monitoring Robustness and Individual Fairness [7.922558880545528]
We propose runtime monitoring of input-output robustness of deployed, black-box AI models.<n>We show that the monitoring problem can be cast as the fixed-radius nearest neighbor (FRNN) search problem.<n>We present our tool Clemont, which offers a number of lightweight monitors.
arXiv Detail & Related papers (2025-05-31T10:27:54Z) - CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception [53.088988929450494]
Collaborative perception (CP) is a promising method for safe connected and autonomous driving.<n>We propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level.<n>We also develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features.
arXiv Detail & Related papers (2025-02-07T12:58:45Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network Monitoring [6.8734954619801885]
runtime monitoring becomes essential to reject unsafe predictions during inference.
Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions.
In real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions.
arXiv Detail & Related papers (2024-05-14T14:32:58Z) - Near-Optimal Multi-Agent Learning for Safe Coverage Control [76.99020416197631]
In multi-agent coverage control problems, agents navigate their environment to reach locations that maximize the coverage of some density.
In this paper, we aim to efficiently learn the density to approximately solve the coverage problem while preserving the agents' safety.
We give first of its kind results: near optimal coverage in finite time while provably guaranteeing safety.
arXiv Detail & Related papers (2022-10-12T16:33:34Z) - Actor-Critic based Improper Reinforcement Learning [61.430513757337486]
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process.
We propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic scheme and a Natural Actor-Critic scheme.
arXiv Detail & Related papers (2022-07-19T05:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.