AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
- URL: http://arxiv.org/abs/2601.05752v1
- Date: Fri, 09 Jan 2026 12:09:45 GMT
- Title: AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
- Authors: Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang,
- Abstract summary: AutoMonitor-Bench is the first benchmark to systematically evaluate the reliability of misbehavior monitors across diverse tasks and failure modes.<n>We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively.<n>We construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors.
- Score: 19.39430341586964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.
Related papers
- PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z) - Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory [11.144603446849674]
Chain-of-thought (CoT) monitors are systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest.<n>In this paper, we show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability.
arXiv Detail & Related papers (2026-02-20T15:50:30Z) - Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z) - Monitoring Monitorability [7.993120960324396]
We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
arXiv Detail & Related papers (2025-12-20T10:46:04Z) - Large language models require a new form of oversight: capability-based monitoring [10.382163755118713]
Large language models (LLMs) in healthcare have been accompanied by scrutiny of their oversight.<n>We propose a new organizing principle guiding generalist LLM monitoring that is scalable and grounded in how these models are developed and used in practice: capability-based monitoring.<n>We describe considerations for developers, organizational leaders, and professional societies for implementing a capability-based monitoring approach.
arXiv Detail & Related papers (2025-11-05T01:20:28Z) - Reliable Weak-to-Strong Monitoring of LLM Agents [6.922769543581406]
We stress test monitoring systems for detecting covert misbehavior in autonomous agents.<n>We develop a monitor red teaming (MRT) workflow that incorporates varying levels of agent and monitor situational awareness.<n>We release code, data, and logs to spur further research.
arXiv Detail & Related papers (2025-08-26T22:29:31Z) - Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach [1.33134751838052]
This paper introduces a novel fuzzy-based monitor tailored for Machine Learning (ML) perception components.<n>It provides human-interpretable explanations about how different operating conditions affect the reliability of perception components and also functions as a runtime safety monitor.
arXiv Detail & Related papers (2025-05-20T14:22:39Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - Unsupervised Continual Anomaly Detection with Contrastively-learned
Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD.
The framework equips the UAD with continual learning capability through contrastively-learned prompts.
We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z) - Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations.
We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety.
We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z) - Benchmarking Safety Monitors for Image Classifiers with Machine Learning [0.0]
High-accurate machine learning (ML) image classifiers cannot guarantee that they will not fail at operation.
The use of fault tolerance mechanisms such as safety monitors is a promising direction to keep the system in a safe state.
This paper aims at establishing a baseline framework for benchmarking monitors for ML image classifiers.
arXiv Detail & Related papers (2021-10-04T07:52:23Z) - Digging into Uncertainty in Self-supervised Multi-view Stereo [57.04768354383339]
We propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning.
Our framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
arXiv Detail & Related papers (2021-08-30T02:53:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.