ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- URL: http://arxiv.org/abs/2410.06703v2
- Date: Thu, 10 Oct 2024 09:38:21 GMT
- Title: ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- Authors: Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov,
- Abstract summary: We present ST-WebAgentBench, a new benchmark specifically designed to evaluate the safety and trustworthiness of web agents in enterprise contexts.
This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior.
Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications.
- Score: 3.09793323158304
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in LLM-based web agents have introduced novel architectures and benchmarks showcasing progress in autonomous web navigation and interaction. However, most existing benchmarks prioritize effectiveness and accuracy, overlooking crucial factors like safety and trustworthiness which are essential for deploying web agents in enterprise settings. The risks of unsafe web agent behavior, such as accidentally deleting user accounts or performing unintended actions in critical business operations, pose significant barriers to widespread adoption. In this paper, we present ST-WebAgentBench, a new online benchmark specifically designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior, outlines how ST policies should be structured and introduces the Completion under Policies metric to assess agent performance. Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications. Additionally, we propose architectural principles aimed at improving policy awareness and compliance in web agents. We open-source this benchmark and invite the community to contribute, with the goal of fostering a new generation of safer, more trustworthy AI agents. All code, data, environment reproduction resources, and video demonstrations are available at https://sites.google.com/view/st-webagentbench/home.
Related papers
- Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.
Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.
Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [20.796190000442053]
We introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents.
We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications.
Our experiments demonstrate that while baseline agents, based on state-of-the-art LLMs, perform well in executing helpful tasks, they show poor performance in safety tasks.
arXiv Detail & Related papers (2024-10-23T02:51:43Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - Building a Cybersecurity Risk Metamodel for Improved Method and Tool Integration [0.38073142980732994]
We report on our experience in applying a model-driven approach on the initial risk analysis step in connection with a later security testing.
Our work rely on a common metamodel which is used to map, synchronise and ensure information traceability across different tools.
arXiv Detail & Related papers (2024-09-12T10:18:26Z) - Athena: Safe Autonomous Agents with Verbal Contrastive Learning [3.102303947219617]
Large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks.
In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning.
The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step.
arXiv Detail & Related papers (2024-08-20T17:21:10Z) - Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems [1.079505444748609]
We present our work on building a novel web agent, Agent-E.
Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents.
We show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30%.
arXiv Detail & Related papers (2024-07-17T21:44:28Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - TrustAgent: Towards Safe and Trustworthy LLM-based Agents [50.33549510615024]
This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a focus on improving the LLM-based agent safety.
The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection.
arXiv Detail & Related papers (2024-02-02T17:26:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.