Related papers: Reliable Weak-to-Strong Monitoring of LLM Agents

Reliable Weak-to-Strong Monitoring of LLM Agents

URL: http://arxiv.org/abs/2508.19461v1
Date: Tue, 26 Aug 2025 22:29:31 GMT
Title: Reliable Weak-to-Strong Monitoring of LLM Agents
Authors: Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q. Knight, Zifan Wang,
Abstract summary: We stress test monitoring systems for detecting covert misbehavior in autonomous agents.<n>We develop a monitor red teaming (MRT) workflow that incorporates varying levels of agent and monitor situational awareness.<n>We release code, data, and logs to spur further research.
Score: 6.922769543581406
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents -- a weak-to-strong scaling effect. Third, in a human-in-the-loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent's behavior, targeted human oversight is most effective; escalating only pre-flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.

Related papers

DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z)
Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks [12.356708678431183]
Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent.<n>We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a- Proxy attack.<n>Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
arXiv Detail & Related papers (2026-02-04T21:38:38Z)
How does information access affect LLM monitors' ability to detect sabotage? [5.941142438950269]
We study how information access affects LLM monitor performance.<n>We show that contemporary systems often perform better with less information.<n>We find that agents unaware of being monitored can be caught much more easily.
arXiv Detail & Related papers (2026-01-28T23:01:31Z)
AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor [19.39430341586964]
AutoMonitor-Bench is the first benchmark to systematically evaluate the reliability of misbehavior monitors across diverse tasks and failure modes.<n>We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively.<n>We construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors.
arXiv Detail & Related papers (2026-01-09T12:09:45Z)
Practical challenges of control monitoring in frontier AI deployments [15.281070904065613]
Control monitors could play an important role in overseeing highly capable AI agents that we do not fully trust.<n>Previous work has explored control monitoring in simplified settings, but scaling monitoring to real-world deployments introduces additional dynamics.<n>In this paper, we analyse design choices to address these challenges, focusing on three forms of monitoring with different latency-safety trade-offs.
arXiv Detail & Related papers (2025-12-15T15:54:36Z)
AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning [78.5751183537704]
AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents.<n>Rather than relying on external guards, AdvEvo-MARL jointly optimize attackers and defenders.
arXiv Detail & Related papers (2025-10-02T02:06:30Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents [60.23552141928126]
PSG-Agent is a personalized and dynamic system for LLM-based agents.<n>First, PSG-Agent creates personalized guardrails by mining the interaction history for stable traits.<n>Second, PSG-Agent implements continuous monitoring across the agent pipeline with specialized guards.
arXiv Detail & Related papers (2025-09-28T03:31:59Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
AgentSight: System-Level Observability for AI Agents Using eBPF [10.37440633887049]
Existing tools observe either an agent's high-level intent (via LLM prompts) or its low-level actions (e.g., system calls) but cannot correlate these two views.<n>We introduce AgentSight, an AgentOps observability framework that bridges this semantic gap using a hybrid approach.<n>AgentSight intercepts TLS-encrypted LLM traffic to extract semantic intent, monitors kernel events to observe system-wide effects, and causally correlates these two streams across process boundaries.
arXiv Detail & Related papers (2025-08-02T01:43:39Z)
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents [8.02267424051267]
Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings.<n>We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks.<n>We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena.
arXiv Detail & Related papers (2025-06-17T15:46:15Z)
SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems [11.497269773189254]
We present a system-level anomaly detection framework tailored for large language model (LLM)-based multi-agent systems (MAS)<n>We propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels.<n>Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning.
arXiv Detail & Related papers (2025-05-30T04:25:19Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems [43.333567687032904]
AgentMonitor is a framework that integrates at the agent level to capture inputs and outputs, transforming them into statistics for training a regression model to predict task performance. It can further apply real-time corrections to address security risks posed by malicious agents, mitigating negative impacts and enhancing MAS security.
arXiv Detail & Related papers (2024-08-27T11:24:38Z)
Learning to Use Tools via Cooperative and Interactive Agents [58.77710337157665]
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. We propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. Our experiments on three datasets show that the LLMs, when equipped with ConAgents, outperform baselines with substantial improvement.
arXiv Detail & Related papers (2024-03-05T15:08:16Z)
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.