ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- URL: http://arxiv.org/abs/2410.06703v5
- Date: Thu, 07 Aug 2025 13:02:19 GMT
- Title: ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- Authors: Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov,
- Abstract summary: Existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust.<n>We introduce textbftextscST-WebAgentBench, a suite for evaluating web agent ST across realistic enterprise scenarios.<n>Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six dimensions (e.g., user consent, robustness)
- Score: 3.09793323158304
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.
Related papers
- Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - A Systematization of Security Vulnerabilities in Computer Use Agents [1.3560089220432787]
We conduct a systematic threat analysis and testing of real-world CUAs under adversarial conditions.<n>We identify seven classes of risks unique to the CUA paradigm, and analyze three concrete exploit scenarios in depth.<n>These case studies reveal deeper architectural flaws across current CUA implementations.
arXiv Detail & Related papers (2025-07-07T19:50:21Z) - LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv Detail & Related papers (2025-05-29T21:39:08Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites.
We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions.
Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z) - Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.
We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.
As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z) - SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z) - SafeArena: Evaluating the Safety of Autonomous Web Agents [65.49740046281116]
LLM-based agents are becoming increasingly proficient at solving web-based tasks.
With this capability comes a greater risk of misuse for malicious purposes.
We propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents.
arXiv Detail & Related papers (2025-03-06T20:43:14Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs)
Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.
Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [42.69984822098671]
Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance.<n>We present SafeAgentBench-the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z) - Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.
Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.
Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [20.796190000442053]
We introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents.
We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications.
Our experiments demonstrate that while baseline agents, based on state-of-the-art LLMs, perform well in executing helpful tasks, they show poor performance in safety tasks.
arXiv Detail & Related papers (2024-10-23T02:51:43Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - Building a Cybersecurity Risk Metamodel for Improved Method and Tool Integration [0.38073142980732994]
We report on our experience in applying a model-driven approach on the initial risk analysis step in connection with a later security testing.
Our work rely on a common metamodel which is used to map, synchronise and ensure information traceability across different tools.
arXiv Detail & Related papers (2024-09-12T10:18:26Z) - Athena: Safe Autonomous Agents with Verbal Contrastive Learning [3.102303947219617]
Large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks.
In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning.
The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step.
arXiv Detail & Related papers (2024-08-20T17:21:10Z) - Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems [1.079505444748609]
We present our work on building a novel web agent, Agent-E.
Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents.
We show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30%.
arXiv Detail & Related papers (2024-07-17T21:44:28Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.
We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.
We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - WebSuite: Systematically Evaluating Why Web Agents Fail [2.200477647229223]
We describe WebSuite, the first diagnostic benchmark for generalist web agents.
This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart.
We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
arXiv Detail & Related papers (2024-06-01T00:32:26Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - TrustAgent: Towards Safe and Trustworthy LLM-based Agents [50.33549510615024]
This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a focus on improving the LLM-based agent safety.
The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection.
arXiv Detail & Related papers (2024-02-02T17:26:23Z) - Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.