WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
- URL: http://arxiv.org/abs/2510.03285v1
- Date: Sun, 28 Sep 2025 20:51:05 GMT
- Title: WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
- Authors: Su Kara, Fazle Faisal, Suman Nath,
- Abstract summary: We present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks.<n>We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL.<n>Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.
- Score: 2.3381951994604977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.
Related papers
- Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents [15.381306470663695]
We present a large scale empirical study of TOCTOU vulnerabilities in browser-use agents.<n> Dynamic or adversarial web content can exploit this window to induce unintended actions.<n>We design a lightweight mitigation based on pre-execution validation.
arXiv Detail & Related papers (2026-02-28T05:25:03Z) - WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents [45.87204751555924]
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones.<n>Existing methods for detecting and localizing such attacks achieve limited effectiveness.<n>We propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages.
arXiv Detail & Related papers (2026-02-03T17:55:04Z) - It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z) - BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks [51.803138848305814]
We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks.<n>We identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs.<n>Our findings surface both the diversity and brittleness of current web agents.
arXiv Detail & Related papers (2025-10-02T15:22:21Z) - WALT: Web Agents that Learn Tools [66.73502484310121]
WALT is a framework that reverse-engineers latent website functionality into reusable invocable tools.<n>Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites.<n>On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning.
arXiv Detail & Related papers (2025-10-01T23:41:47Z) - A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains [23.412858949638263]
Current benchmarks in the e-commerce domain face two major problems.<n>They primarily focus on product search tasks, failing to capture the broader range of functionalities offered by real-world e-commerce platforms.<n>We propose a new benchmark called Amazon-Bench to generate user queries that cover a broad range of tasks.
arXiv Detail & Related papers (2025-08-18T21:58:43Z) - WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks [7.4706262500758385]
We introduce WebArXiv, a benchmark for evaluating autonomous web agents.<n>WebArXiv consists of 275 web-based tasks grounded in the arXiv platform.<n>We propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps.
arXiv Detail & Related papers (2025-07-01T16:43:57Z) - WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z) - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [61.38499597241457]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.<n>Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.<n>We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z) - ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents [3.09793323158304]
Existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust.<n>We introduce textbftextscST-WebAgentBench, a suite for evaluating web agent ST across realistic enterprise scenarios.<n>Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six dimensions (e.g., user consent, robustness)
arXiv Detail & Related papers (2024-10-09T09:13:38Z) - WebSuite: Systematically Evaluating Why Web Agents Fail [2.200477647229223]
We describe WebSuite, the first diagnostic benchmark for generalist web agents.
This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart.
We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
arXiv Detail & Related papers (2024-06-01T00:32:26Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.