Related papers: Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation

Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation

URL: http://arxiv.org/abs/2601.06034v1
Date: Fri, 28 Nov 2025 10:25:12 GMT
Title: Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation
Authors: Dudekula Kasim Vali,
Abstract summary: Large Language Models (LLMs) can generate code that hallucinates non-existent UI elements.<n>We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds script generation in project-specific documentation and HTML structure.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software testing is critical in the software development lifecycle, yet translating requirements into executable test scripts remains manual and error-prone. While Large Language Models (LLMs) can generate code, they often hallucinate non-existent UI elements. We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds Selenium script generation in project-specific documentation and HTML structure. By ingesting diverse formats (Markdown, PDF, HTML) into a vector database, our system retrieves relevant context before generation. Evaluation on 20 e-commerce test scenarios shows our RAG approach achieves 100% (20/20) syntax validity and 90% (18/20, 95% CI: [85%, 95%], p < 0.001) execution success, compared to 30% for standard LLM generation. While our evaluation is limited to a single domain, our method significantly reduces hallucinations by grounding generation in actual DOM structure, demonstrating RAG's potential for automated UI testing.

Related papers

iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation [48.502477318243386]
iScript is a domain-adapted Qwen3-8B model for Innovus Tcl script generation.<n>iScript shows higher pass@k scores than currently state-of-the-art LLMs on average.
arXiv Detail & Related papers (2026-03-04T15:20:35Z)
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z)
Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework [4.53273595732354]
This paper introduces a novel method for training large language models (LLMs) to generate high-quality test cases in Selenium.<n>We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios.<n>Our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics.
arXiv Detail & Related papers (2025-11-19T06:43:21Z)
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models [10.977990951788422]
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium.<n> MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty.<n>Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%)
arXiv Detail & Related papers (2025-10-05T21:15:11Z)
Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models [0.0]
Generative AI is gaining increasing attention in software engineering.<n>Unit tests constitute the majority of test cases and are often schematic.<n>This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests.
arXiv Detail & Related papers (2025-07-18T11:23:17Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
APITestGenie: Automated API Test Generation through Generative AI [2.0716352593701277]
APITestGenie generates executable API test scripts from business requirements and API specifications. In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time. Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines.
arXiv Detail & Related papers (2024-09-05T18:02:41Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.