Related papers: Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection

Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection

URL: http://arxiv.org/abs/2510.03485v1
Date: Fri, 03 Oct 2025 20:03:19 GMT
Title: Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection
Authors: Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, Muhao Chen,
Abstract summary: PolicyGuardBench is a benchmark of about 60k examples for detecting policy violations in agent trajectories.<n>PolicyGuard-4B is a lightweight guardrail model that delivers strong detection accuracy across all tasks.<n>Together, PolicyGuardBench and PolicyGuard-4B provide the first comprehensive framework for studying policy compliance in web agent trajectories.
Score: 25.53228630260007
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Autonomous web agents need to operate under externally imposed or human-specified policies while generating long-horizon trajectories. However, little work has examined whether these trajectories comply with such policies, or whether policy violations persist across different contexts such as domains (e.g., shopping or coding websites) and subdomains (e.g., product search and order management in shopping). To address this gap, we introduce PolicyGuardBench, a benchmark of about 60k examples for detecting policy violations in agent trajectories. From diverse agent runs, we generate a broad set of policies and create both within subdomain and cross subdomain pairings with violation labels. In addition to full-trajectory evaluation, PolicyGuardBench also includes a prefix-based violation detection task where models must anticipate policy violations from truncated trajectory prefixes rather than complete sequences. Using this dataset, we train PolicyGuard-4B, a lightweight guardrail model that delivers strong detection accuracy across all tasks while keeping inference efficient. Notably, PolicyGuard-4B generalizes across domains and preserves high accuracy on unseen settings. Together, PolicyGuardBench and PolicyGuard-4B provide the first comprehensive framework for studying policy compliance in web agent trajectories, and show that accurate and generalizable guardrails are feasible at small scales.

Related papers

QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems [22.833567409552074]
textscQuadSentinel is a four-agent guard that compiles safety policies into machine-checkable rules.<n>textscQuadSentinel improves guardrail accuracy and rule recall while reducing false positives.
arXiv Detail & Related papers (2025-12-18T07:58:40Z)
Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs [21.5603664964501]
We propose a training-free and efficient method that treats policy violation detection as an out-of-distribution detection problem.<n>Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance.<n>On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models.
arXiv Detail & Related papers (2025-12-03T17:23:39Z)
Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents [0.19336815376402716]
We introduce a regulatory machine learning framework that converts unstructured design artifacts (like PRDs, TDDs, and code) into verifiable runtime guardrails.<n>Our Policy as Prompt method reads these documents and risk controls to build a source-linked policy tree.<n>System is built to enforce least privilege and data minimization.
arXiv Detail & Related papers (2025-09-28T17:36:52Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
Effective Red-Teaming of Policy-Adherent Agents [10.522087614181745]
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules.<n>We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit.<n>We present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario.
arXiv Detail & Related papers (2025-06-11T10:59:47Z)
Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z)
Dense Policy: Bidirectional Autoregressive Learning of Actions [51.60428100831717]
This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction.<n>It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner.<n>Experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies.
arXiv Detail & Related papers (2025-03-17T14:28:08Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment. Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z)
POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning [30.834631947104498]
We present POLTER - a method to regularize the pretraining that can be applied to any URL algorithm. We evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case.
arXiv Detail & Related papers (2022-05-23T14:42:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.