AI Agents for Web Testing: A Case Study in the Wild
- URL: http://arxiv.org/abs/2509.05197v1
- Date: Fri, 05 Sep 2025 15:57:16 GMT
- Title: AI Agents for Web Testing: A Case Study in the Wild
- Authors: Naimeng Ye, Xiao Yu, Ruize Xu, Tianyi Peng, Zhou Yu,
- Abstract summary: We present WebProber, a prototype AI agent-based web testing framework.<n>Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report.<n>We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues--many of which were missed by traditional tools.
- Score: 20.669140680308494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated web testing plays a critical role in ensuring high-quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human-like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent-based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues--many of which were missed by traditional tools. Our findings highlight agent-based testing as a promising direction while outlining directions for developing next-generation, user-centered testing frameworks.
Related papers
- Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts [59.68272935616536]
Avenir-Web is a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment.<n>We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks.
arXiv Detail & Related papers (2026-02-02T18:50:07Z) - Building the Web for Agents: A Declarative Framework for Agent-Web Interaction [0.7116403133334644]
We introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents.<n> VOIX introduces tool> and context> tags, allowing developers to explicitly define available actions and relevant state.<n>We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers.
arXiv Detail & Related papers (2025-11-14T13:23:34Z) - WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code [57.45181837786448]
Multimodal Large Language Models (MLLMs) have the potential to act as AI software engineers capable of executing complex web application development.<n>Existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes.<n>We propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code.
arXiv Detail & Related papers (2025-06-09T14:46:02Z) - TESTQUEST: A Web Gamification Tool to Improve Locators and Page Objects Quality [2.156170153103442]
TestQUEST is a tool designed to improve test robustness by applying to locators and Page Objects.<n> locators are highly sensitive to the frequent changes in Web page structures caused by rapid software evolution.
arXiv Detail & Related papers (2025-05-30T16:18:10Z) - WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z) - AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents [28.20409050985182]
A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants.<n>We present AgentA/B, a novel system that automatically simulate user interaction behaviors with real webpages.<n>Our findings suggest AgentA/B can emulate human-like behavior patterns.
arXiv Detail & Related papers (2025-04-13T21:10:56Z) - WebGames: Challenging General-Purpose Web-Browsing AI Agents [11.320069795732058]
WebGames is a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents.<n>We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance.<n>Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%.
arXiv Detail & Related papers (2025-02-25T16:45:08Z) - Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead [43.15092098658384]
Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective.<n>We explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing)
arXiv Detail & Related papers (2024-12-11T17:57:23Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.