WebGuard: Building a Generalizable Guardrail for Web Agents
- URL: http://arxiv.org/abs/2507.14293v1
- Date: Fri, 18 Jul 2025 18:06:27 GMT
- Title: WebGuard: Building a Generalizable Guardrail for Web Agents
- Authors: Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su,
- Abstract summary: WebGuard is the first dataset designed to support the assessment of web agent action risks.<n>It contains 4,939 human-annotated actions from 193 websites across 22 diverse domains.
- Score: 59.31116061613742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.
Related papers
- When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent [20.98129117390391]
We present the first systematic study of social engineering attacks against web automation agents.<n>We introduce the AgentBait paradigm, which exploits intrinsic weaknesses in agent execution.<n>We propose SUPERVISOR, a lightweight runtime module that enforces environment intention and consistency alignment.
arXiv Detail & Related papers (2026-01-12T07:10:08Z) - ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z) - It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z) - Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System [4.864011355064205]
This paper introduces an automated red-teaming framework that generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in large language models (LLMs)<n>Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories.<n> Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns.
arXiv Detail & Related papers (2025-12-21T19:12:44Z) - Death by a Thousand Prompts: Open Model Vulnerability Analysis [0.06213771671016099]
Open-weight models provide researchers and developers with foundations for diverse downstream applications.<n>We tested the safety and security postures of eight open-weight large language models (LLMs) to identify vulnerabilities that may impact subsequent fine-tuning and deployment.<n>Our findings reveal pervasive vulnerabilities across all tested models, with multi-turn attacks achieving success rates between 25.86% and 92.78%.
arXiv Detail & Related papers (2025-11-05T07:22:24Z) - AI Kill Switch for malicious web-based LLM agent [4.144114850905779]
We propose an AI Kill Switch technique that can halt the operation of malicious web-based LLM agents.<n>Key idea is generating defensive prompts that trigger the safety mechanisms of malicious LLM agents.<n>AutoGuard achieves over 80% Defense Success Rate (DSR) across diverse malicious agents.
arXiv Detail & Related papers (2025-09-26T02:20:46Z) - Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking [8.970702398918924]
Large Language Model (LLM) agents exhibit powerful autonomous capabilities across domains such as robotics, virtual assistants, and web automation.<n>Existing rule-based enforcement systems, such as AgentSpec, focus on developing reactive safety rules.<n>We propose Pro2Guard, a proactive runtime enforcement framework grounded in probabilistic reachability analysis.
arXiv Detail & Related papers (2025-08-01T10:24:47Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare [43.75158832964138]
Tiered Agentic Oversight (TAO) is a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision.<n>Inspired by clinical hierarchies (e.g., nurse, physician, specialist), TAO conducts agent routing based on task complexity and agent roles.
arXiv Detail & Related papers (2025-06-14T12:46:10Z) - LlamaFirewall: An open source guardrail system for building secure AI agents [0.5603362829699733]
Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks.<n>Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor.<n>We introduce LlamaFirewall, an open-source security focused guardrail framework.
arXiv Detail & Related papers (2025-05-06T14:34:21Z) - Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization [74.78433600288776]
HKVE (Hierarchical Key-Value Equalization) is an innovative jailbreaking framework that selectively accepts gradient optimization results.<n>We show that HKVE substantially outperforms existing methods by substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively.
arXiv Detail & Related papers (2025-03-14T17:57:42Z) - SafeArena: Evaluating the Safety of Autonomous Web Agents [65.49740046281116]
LLM-based agents are becoming increasingly proficient at solving web-based tasks.<n>With this capability comes a greater risk of misuse for malicious purposes.<n>We propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents.
arXiv Detail & Related papers (2025-03-06T20:43:14Z) - AdvAgent: Controllable Blackbox Red-teaming on Web Agents [22.682464365220916]
AdvAgent is a black-box red-teaming framework for attacking web agents.<n>It employs a reinforcement learning-based pipeline to train an adversarial prompter model.<n>With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability.
arXiv Detail & Related papers (2024-10-22T20:18:26Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.