Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation
- URL: http://arxiv.org/abs/2601.06034v1
- Date: Fri, 28 Nov 2025 10:25:12 GMT
- Title: Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation
- Authors: Dudekula Kasim Vali,
- Abstract summary: Large Language Models (LLMs) can generate code that hallucinates non-existent UI elements.<n>We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds script generation in project-specific documentation and HTML structure.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software testing is critical in the software development lifecycle, yet translating requirements into executable test scripts remains manual and error-prone. While Large Language Models (LLMs) can generate code, they often hallucinate non-existent UI elements. We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds Selenium script generation in project-specific documentation and HTML structure. By ingesting diverse formats (Markdown, PDF, HTML) into a vector database, our system retrieves relevant context before generation. Evaluation on 20 e-commerce test scenarios shows our RAG approach achieves 100% (20/20) syntax validity and 90% (18/20, 95% CI: [85%, 95%], p < 0.001) execution success, compared to 30% for standard LLM generation. While our evaluation is limited to a single domain, our method significantly reduces hallucinations by grounding generation in actual DOM structure, demonstrating RAG's potential for automated UI testing.
Related papers
- iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation [48.502477318243386]
iScript is a domain-adapted Qwen3-8B model for Innovus Tcl script generation.<n>iScript shows higher pass@k scores than currently state-of-the-art LLMs on average.
arXiv Detail & Related papers (2026-03-04T15:20:35Z) - RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z) - Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework [4.53273595732354]
This paper introduces a novel method for training large language models (LLMs) to generate high-quality test cases in Selenium.<n>We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios.<n>Our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics.
arXiv Detail & Related papers (2025-11-19T06:43:21Z) - MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models [10.977990951788422]
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium.<n> MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty.<n>Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%)
arXiv Detail & Related papers (2025-10-05T21:15:11Z) - Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models [0.0]
Generative AI is gaining increasing attention in software engineering.<n>Unit tests constitute the majority of test cases and are often schematic.<n>This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests.
arXiv Detail & Related papers (2025-07-18T11:23:17Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z) - TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - APITestGenie: Automated API Test Generation through Generative AI [2.0716352593701277]
APITestGenie generates executable API test scripts from business requirements and API specifications.
In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time.
Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines.
arXiv Detail & Related papers (2024-09-05T18:02:41Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.