Generating Test Scenarios from NL Requirements using Retrieval-Augmented LLMs: An Industrial Study
- URL: http://arxiv.org/abs/2404.12772v1
- Date: Fri, 19 Apr 2024 10:27:40 GMT
- Title: Generating Test Scenarios from NL Requirements using Retrieval-Augmented LLMs: An Industrial Study
- Authors: Chetan Arora, Tomas Herda, Verena Homm,
- Abstract summary: This paper presents an automated approach (RAGTAG) for test scenario generation using Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs)
We evaluate RAGTAG on two industrial projects from Austrian Post with bilingual requirements in German and English.
- Score: 5.179738379203527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test scenarios are specific instances of test cases that describe actions to validate a particular software functionality. By outlining the conditions under which the software operates and the expected outcomes, test scenarios ensure that the software functionality is tested in an integrated manner. Test scenarios are crucial for systematically testing an application under various conditions, including edge cases, to identify potential issues and guarantee overall performance and reliability. Specifying test scenarios is tedious and requires a deep understanding of software functionality and the underlying domain. It further demands substantial effort and investment from already time- and budget-constrained requirements engineers and testing teams. This paper presents an automated approach (RAGTAG) for test scenario generation using Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs). RAG allows the integration of specific domain knowledge with LLMs' generation capabilities. We evaluate RAGTAG on two industrial projects from Austrian Post with bilingual requirements in German and English. Our results from an interview survey conducted with four experts on five dimensions -- relevance, coverage, correctness, coherence and feasibility, affirm the potential of RAGTAG in automating test scenario generation. Specifically, our results indicate that, despite the difficult task of analyzing bilingual requirements, RAGTAG is able to produce scenarios that are well-aligned with the underlying requirements and provide coverage of different aspects of the intended functionality. The generated scenarios are easily understandable to experts and feasible for testing in the project environment. The overall correctness is deemed satisfactory; however, gaps in capturing exact action sequences and domain nuances remain, underscoring the need for domain expertise when applying LLMs.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - The Potential of LLMs in Automating Software Testing: From Generation to Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.
Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.
This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv Detail & Related papers (2024-12-31T02:06:46Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.
Our benchmark is characterized by its multi-dimensional evaluation framework.
Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - Practical, Automated Scenario-based Mobile App Testing [13.52057950260007]
Test scripts developed by human testers consider business logic by focusing on testing scenarios.
Due to the GUI-intensive feature of mobile apps, human testers always understand app GUI to organize test scripts for scenarios.
ScenTest tries to start automated testing by imitating human practices and integrating domain knowledge into scenario-based mobile app testing.
arXiv Detail & Related papers (2024-06-12T15:48:39Z) - A Tool for Test Case Scenarios Generation Using Large Language Models [3.9422957660677476]
This article centers on generating user requirements as epics and high-level user stories.
It introduces a web-based software tool that employs an LLM-based agent and prompt engineering to automate the generation of test case scenarios.
arXiv Detail & Related papers (2024-06-11T07:26:13Z) - Selene: Pioneering Automated Proof in Software Verification [62.09555413263788]
We introduce Selene, which is the first project-level automated proof benchmark constructed based on the real-world industrial-level operating system microkernel, seL4.
Our experimental results with advanced large language models (LLMs), such as GPT-3.5-turbo and GPT-4, highlight the capabilities of LLMs in the domain of automated proof generation.
arXiv Detail & Related papers (2024-01-15T13:08:38Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - LLM for Test Script Generation and Migration: Challenges, Capabilities,
and Opportunities [8.504639288314063]
Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks.
Existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test scripts across diverse devices, platforms, and applications.
This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation.
arXiv Detail & Related papers (2023-09-24T07:58:57Z) - Prioritized Variable-length Test Cases Generation for Finite State
Machines [0.09786690381850353]
Model-based Testing (MBT) is an effective approach for testing when parts of a system-under-test have the characteristics of a finite state machine (FSM)
This paper presents a test generation strategy that satisfies all these requirements.
Depending on the application of the FSM, the strategy and evaluation presented in this paper are applicable both in testing functional and non-functional software requirements.
arXiv Detail & Related papers (2022-03-17T20:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.