WebGuard: Building a Generalizable Guardrail for Web Agents
- URL: http://arxiv.org/abs/2507.14293v1
- Date: Fri, 18 Jul 2025 18:06:27 GMT
- Title: WebGuard: Building a Generalizable Guardrail for Web Agents
- Authors: Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su,
- Abstract summary: WebGuard is the first dataset designed to support the assessment of web agent action risks.<n>It contains 4,939 human-annotated actions from 193 websites across 22 diverse domains.
- Score: 59.31116061613742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.
Related papers
- Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking [8.970702398918924]
Large Language Model (LLM) agents exhibit powerful autonomous capabilities across domains such as robotics, virtual assistants, and web automation.<n>Existing rule-based enforcement systems, such as AgentSpec, focus on developing reactive safety rules.<n>We propose Pro2Guard, a proactive runtime enforcement framework grounded in probabilistic reachability analysis.
arXiv Detail & Related papers (2025-08-01T10:24:47Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare [43.75158832964138]
Tiered Agentic Oversight (TAO) is a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision.<n>Inspired by clinical hierarchies (e.g., nurse, physician, specialist), TAO conducts agent routing based on task complexity and agent roles.
arXiv Detail & Related papers (2025-06-14T12:46:10Z) - LlamaFirewall: An open source guardrail system for building secure AI agents [0.5603362829699733]
Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks.<n>Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor.<n>We introduce LlamaFirewall, an open-source security focused guardrail framework.
arXiv Detail & Related papers (2025-05-06T14:34:21Z) - Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization [74.78433600288776]
HKVE (Hierarchical Key-Value Equalization) is an innovative jailbreaking framework that selectively accepts gradient optimization results.<n>We show that HKVE substantially outperforms existing methods by substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively.
arXiv Detail & Related papers (2025-03-14T17:57:42Z) - SafeArena: Evaluating the Safety of Autonomous Web Agents [65.49740046281116]
LLM-based agents are becoming increasingly proficient at solving web-based tasks.<n>With this capability comes a greater risk of misuse for malicious purposes.<n>We propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents.
arXiv Detail & Related papers (2025-03-06T20:43:14Z) - AdvAgent: Controllable Blackbox Red-teaming on Web Agents [22.682464365220916]
AdvAgent is a black-box red-teaming framework for attacking web agents.<n>It employs a reinforcement learning-based pipeline to train an adversarial prompter model.<n>With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability.
arXiv Detail & Related papers (2024-10-22T20:18:26Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.