LLMCFG-TGen: Using LLM-Generated Control Flow Graphs to Automatically Create Test Cases from Use Cases
- URL: http://arxiv.org/abs/2512.06401v1
- Date: Sat, 06 Dec 2025 11:19:37 GMT
- Title: LLMCFG-TGen: Using LLM-Generated Control Flow Graphs to Automatically Create Test Cases from Use Cases
- Authors: Zhenzhen Yang, Chenhui Cui, Tao Li, Rubing Huang, Nan Niu, Dave Towey, Shikai Guo,
- Abstract summary: Appropriate test case generation is critical in software testing.<n>Use-case descriptions are a popular method for capturing functional behaviors and interaction flows in a structured form.<n>We propose a new approach that automatically generates test cases from NL use-case descriptions.
- Score: 11.173694789846435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Appropriate test case generation is critical in software testing, significantly impacting the quality of the testing. Requirements-Based Test Generation (RBTG) derives test cases from software requirements, aiming to verify whether or not the system's behaviors align with user needs and expectations. Requirements are often documented in Natural Language (NL), with use-case descriptions being a popular method for capturing functional behaviors and interaction flows in a structured form. Large Language Models (LLMs) have shown strong potential for automating test generation directly from NL requirements. However, current LLM-based approaches may not provide comprehensive, non-redundant coverage. They may also fail to capture complex conditional logic in requirements, resulting in incomplete test cases. We propose a new approach that automatically generates test cases from NL use-case descriptions, called Test Generation based on LLM-generated Control Flow Graphs (LLMCFG-TGen). LLMCFG-TGen comprises three main steps: (1) An LLM transforms a use case into a structured CFG that encapsulates all potential branches; (2) The generated CFG is explored, and all complete execution paths are enumerated; and (3) The execution paths are then used to generate the test cases. To evaluate our proposed approach, we conducted a series of experiments. The results show that LLMs can effectively construct well-structured CFGs from NL use cases. Compared with the baseline methods, LLMCFG-TGen achieves full path coverage, improving completeness and ensuring clear and accurate test cases. Practitioner assessments confirm that LLMCFG-TGen produces logically consistent and comprehensive test cases, while substantially reducing manual effort. The findings suggest that coupling LLM-based semantic reasoning with structured modeling effectively bridges the gap between NL requirements and systematic test generation.
Related papers
- On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language [4.290931412096985]
The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction.<n>Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents.<n>This paper investigates the impact on NL test case unsoundness and on test case execution consistency.
arXiv Detail & Related papers (2025-09-23T15:20:40Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - TestWeaver: Execution-aware, Feedback-driven Regression Testing Generation with Large Language Models [5.871736617580623]
Regression testing ensures that code changes do not unintentionally break existing functionality.<n>Recent advances in large language models (LLMs) have shown promise in automating test generation for regression testing.<n>We present TestWeaver, a novel approach that integrates lightweight program analysis to guide test generation more effectively.
arXiv Detail & Related papers (2025-08-02T08:13:02Z) - Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models (LMs)<n>DisCIPL generates a task-specific inference program that is executed by a population of Follower models.<n>Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z) - A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)<n>By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.<n>Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z) - Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy [0.5057850174013127]
Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code.<n>This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs.<n>Experiments show that VALTEST boosts test validity by up to 29% and improves code generation performance, as evidenced by significant increases in pass@1 scores.
arXiv Detail & Related papers (2024-11-13T00:07:32Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Adaptable Logical Control for Large Language Models [68.27725600175013]
Ctrl-G is an adaptable framework that facilitates tractable and flexible control of model generation at inference time.
We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing.
arXiv Detail & Related papers (2024-06-19T23:47:59Z) - Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications.
The quality of these exemplars in the prompt greatly impacts performance.
Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z) - Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.