Related papers: AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

URL: http://arxiv.org/abs/2601.05752v1
Date: Fri, 09 Jan 2026 12:09:45 GMT
Title: AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
Authors: Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang,
Abstract summary: AutoMonitor-Bench is the first benchmark to systematically evaluate the reliability of misbehavior monitors across diverse tasks and failure modes.<n>We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively.<n>We construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors.
Score: 19.39430341586964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

Related papers

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory [11.144603446849674]
Chain-of-thought (CoT) monitors are systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest.<n>In this paper, we show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability.
arXiv Detail & Related papers (2026-02-20T15:50:30Z)
Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
Monitoring Monitorability [7.993120960324396]
We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
arXiv Detail & Related papers (2025-12-20T10:46:04Z)
Large language models require a new form of oversight: capability-based monitoring [10.382163755118713]
Large language models (LLMs) in healthcare have been accompanied by scrutiny of their oversight.<n>We propose a new organizing principle guiding generalist LLM monitoring that is scalable and grounded in how these models are developed and used in practice: capability-based monitoring.<n>We describe considerations for developers, organizational leaders, and professional societies for implementing a capability-based monitoring approach.
arXiv Detail & Related papers (2025-11-05T01:20:28Z)
Reliable Weak-to-Strong Monitoring of LLM Agents [6.922769543581406]
We stress test monitoring systems for detecting covert misbehavior in autonomous agents.<n>We develop a monitor red teaming (MRT) workflow that incorporates varying levels of agent and monitor situational awareness.<n>We release code, data, and logs to spur further research.
arXiv Detail & Related papers (2025-08-26T22:29:31Z)
Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach [1.33134751838052]
This paper introduces a novel fuzzy-based monitor tailored for Machine Learning (ML) perception components.<n>It provides human-interpretable explanations about how different operating conditions affect the reliability of perception components and also functions as a runtime safety monitor.
arXiv Detail & Related papers (2025-05-20T14:22:39Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z)
Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD. The framework equips the UAD with continual learning capability through contrastively-learned prompts. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z)
Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety. We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z)
Benchmarking Safety Monitors for Image Classifiers with Machine Learning [0.0]
High-accurate machine learning (ML) image classifiers cannot guarantee that they will not fail at operation. The use of fault tolerance mechanisms such as safety monitors is a promising direction to keep the system in a safe state. This paper aims at establishing a baseline framework for benchmarking monitors for ML image classifiers.
arXiv Detail & Related papers (2021-10-04T07:52:23Z)
Digging into Uncertainty in Self-supervised Multi-view Stereo [57.04768354383339]
We propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning. Our framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
arXiv Detail & Related papers (2021-08-30T02:53:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.