WebSuite: Systematically Evaluating Why Web Agents Fail
- URL: http://arxiv.org/abs/2406.01623v1
- Date: Sat, 1 Jun 2024 00:32:26 GMT
- Title: WebSuite: Systematically Evaluating Why Web Agents Fail
- Authors: Eric Li, Jim Waldo,
- Abstract summary: We describe WebSuite, the first diagnostic benchmark for generalist web agents.
This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart.
We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
- Score: 2.200477647229223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe WebSuite, the first diagnostic benchmark for generalist web agents, designed to systematically evaluate why agents fail. Advances in AI have led to the rise of numerous web agents that autonomously operate a browser to complete tasks. However, most existing benchmarks focus on strictly measuring whether an agent can or cannot complete a task, without giving insight on why. In this paper, we 1) develop a taxonomy of web actions to facilitate identifying common failure patterns, and 2) create an extensible benchmark suite to assess agents' performance on our taxonomized actions. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart, and is designed such that any failure of a task can be attributed directly to a failure of a specific web action. We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent. Because WebSuite can disaggregate task failures into specific action failures, this enables granular identification of which UX flows an individual agent has trouble with and immediately highlights promising avenues for improvement. These findings highlight the need for more focused benchmarking on where web agents go wrong to effectively improve agents beyond their weaker performance today.
Related papers
- WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks [36.97842000562324]
A benchmark called WASP introduces realistic web agent hijacking objectives and an isolated environment to test them.
Our evaluation shows that even AI agents backed by models with advanced reasoning capabilities are susceptible to low-effort human-written prompt injections.
Agents begin executing the adversarial instruction between 16 and 86% of the time but only achieve the goal between 0 and 17% of the time.
arXiv Detail & Related papers (2025-04-22T17:51:03Z) - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.
Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.
We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z) - An Illusion of Progress? Assessing the Current State of Web Agents [49.76769323750729]
We conduct a comprehensive and rigorous assessment of the current state of web agents.
Results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results.
We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites.
arXiv Detail & Related papers (2025-04-02T05:51:29Z) - Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents [68.22496852535937]
We introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning.
Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly.
We train our intent predictor to predict the next intent given the agent's past observations and actions.
arXiv Detail & Related papers (2024-10-29T21:37:04Z) - Beyond Browsing: API-Based Web Agents [58.39129004543844]
API-based agents outperform web browsing agents in experiments on WebArena.
Hybrid Agents out-perform both others nearly uniformly across tasks.
Results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
arXiv Detail & Related papers (2024-10-21T19:46:06Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.94654815220404]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.
G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z) - Multimodal Auto Validation For Self-Refinement in Web Agents [0.5843533603338313]
This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement.
We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents.
We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures.
arXiv Detail & Related papers (2024-10-01T13:43:55Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.
We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.
We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web.
With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents.
Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.