Related papers: A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

URL: http://arxiv.org/abs/2508.15832v1
Date: Mon, 18 Aug 2025 21:58:43 GMT
Title: A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
Authors: Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans,
Abstract summary: Current benchmarks in the e-commerce domain face two major problems.<n>They primarily focus on product search tasks, failing to capture the broader range of functionalities offered by real-world e-commerce platforms.<n>We propose a new benchmark called Amazon-Bench to generate user queries that cover a broad range of tasks.
Score: 23.412858949638263
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

Related papers

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
DRBench: A Realistic Benchmark for Enterprise Deep Research [81.49694432639406]
DRBench is a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.<n>We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance.
arXiv Detail & Related papers (2025-09-30T18:47:20Z)
WAREX: Web Agent Reliability Evaluation on Existing Benchmarks [2.3381951994604977]
We present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks.<n>We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL.<n>Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.
arXiv Detail & Related papers (2025-09-28T20:51:05Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
DeepShop: A Benchmark for Deep Research Shopping Agents [70.03744154560717]
DeepShop is a benchmark designed to evaluate web agents in complex and realistic online shopping environments.<n>We generate diverse queries across five popular online shopping domains.<n>We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects.
arXiv Detail & Related papers (2025-06-03T13:08:17Z)
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites.<n>We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions.<n>Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z)
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.<n>Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.<n>We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z)
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents [3.09793323158304]
Existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust.<n>We introduce textbftextscST-WebAgentBench, a suite for evaluating web agent ST across realistic enterprise scenarios.<n>Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six dimensions (e.g., user consent, robustness)
arXiv Detail & Related papers (2024-10-09T09:13:38Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
WebSuite: Systematically Evaluating Why Web Agents Fail [2.200477647229223]
We describe WebSuite, the first diagnostic benchmark for generalist web agents. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart. We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
arXiv Detail & Related papers (2024-06-01T00:32:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.