Related papers: Towards Automated Page Object Generation for Web Testing using Large Language Models

Towards Automated Page Object Generation for Web Testing using Large Language Models

URL: http://arxiv.org/abs/2602.19294v1
Date: Sun, 22 Feb 2026 18:06:57 GMT
Title: Towards Automated Page Object Generation for Web Testing using Large Language Models
Authors: Betül Karagöz, Filippo Ricca, Matteo Biagiola, Andrea Stocco,
Abstract summary: This paper presents an empirical study on the feasibility of using Large Language Models (LLMs) to automatically generate Page Objects (POs) for web testing.<n>Our results show that LLMs can generate syntactically correct and functionally useful POs with accuracy values ranging from 32.6% to 54.0% and element recognition rate exceeding 70% in most cases.
Score: 2.451367554740889
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Page Objects (POs) are a widely adopted design pattern for improving the maintainability and scalability of automated end-to-end web tests. However, creating and maintaining POs is still largely a manual, labor-intensive activity, while automated solutions have seen limited practical adoption. In this context, the potential of Large Language Models (LLMs) for these tasks has remained largely unexplored. This paper presents an empirical study on the feasibility of using LLMs, specifically GPT-4o and DeepSeek Coder, to automatically generate POs for web testing. We evaluate the generated artifacts on an existing benchmark of five web applications for which manually written POs are available (the ground truth), focusing on accuracy (i.e., the proportion of ground truth elements correctly identified) and element recognition rate (i.e., the proportion of ground truth elements correctly identified or marked for modification). Our results show that LLMs can generate syntactically correct and functionally useful POs with accuracy values ranging from 32.6% to 54.0% and element recognition rate exceeding 70% in most cases. Our study contributes the first systematic evaluation of LLMs strengths and open challenges for automated PO generation, and provides directions for further research on integrating LLMs into practical testing workflows.

Related papers

Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework [4.53273595732354]
This paper introduces a novel method for training large language models (LLMs) to generate high-quality test cases in Selenium.<n>We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios.<n>Our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics.
arXiv Detail & Related papers (2025-11-19T06:43:21Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [33.250579401886206]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance within the Model Context Protocol (MCP) framework.<n> MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations.<n>Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
TMIQ: Quantifying Test and Measurement Domain Intelligence in Large Language Models [0.0]
We introduce the Test and Measurement Intelligence Quotient (TMIQ), a benchmark designed to quantitatively assess Large Language Models (LLMs)<n>TMIQ offers a comprehensive set of scenarios and metrics for detailed evaluation, including SCPI command matching accuracy, ranked response evaluation, Chain-of-Thought Reasoning (CoT)<n>In testing various LLMs, our findings indicate varying levels of proficiency, with exact SCPI command match accuracy ranging from around 56% to 73%, and ranked matching first-position scores achieving around 33%.
arXiv Detail & Related papers (2025-03-03T23:12:49Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
An efficient approach to represent enterprise web application structure using Large Language Model in the service of Intelligent Quality Engineering [0.0]
This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs)<n>We introduce a hierarchical representation methodology that optimize the few-shot learning capabilities of LLMs.<n>Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing.
arXiv Detail & Related papers (2025-01-12T15:10:57Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Improving web element localization by using a large language model [6.126394204968227]
Large Language Models (LLMs) can show human-like reasoning abilities on some tasks. This paper introduces and evaluates VON Similo LLM, an enhanced web element localization approach.
arXiv Detail & Related papers (2023-10-03T13:39:22Z)
Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.