Directed Grammar-Based Test Generation
- URL: http://arxiv.org/abs/2508.01472v1
- Date: Sat, 02 Aug 2025 19:43:15 GMT
- Title: Directed Grammar-Based Test Generation
- Authors: Lukas Kirschner, Ezekiel Soremekun,
- Abstract summary: This work proposes an automated test generation approach (called FdLoop)<n>FdLoop iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs.<n>We evaluate FdLoop using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software.
- Score: 2.0948216657769616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To effectively test complex software, it is important to generate goal-specific inputs, i.e., inputs that achieve a specific testing goal. However, most state-of-the-art test generators are not designed to target specific goals. Notably, grammar-based test generators, which (randomly) produce syntactically valid inputs via an input specification (i.e., grammar) have a low probability of achieving an arbitrary testing goal. This work addresses this challenge by proposing an automated test generation approach (called FdLoop) which iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs. Given a testing goal, FdLoop iteratively selects, evolves and learn the input distribution of goal-specific test inputs via test feedback and a probabilistic grammar. We concretize FdLoop for four testing goals, namely unique code coverage, input-to-code complexity, program failures (exceptions) and long execution time. We evaluate FdLoop using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software. In most (86%) settings, FdLoop outperforms all five tested baselines namely the baseline grammar-based test generators (random, probabilistic and inverse-probabilistic methods), EvoGFuzz and DynaMosa. FdLoop is (up to) twice (2X) as effective as the best baseline (EvoGFuzz) in inducing erroneous behaviors. In addition, we show that the main components of FdLoop (i.e., input mutator, grammar mutator and test feedbacks) contribute positively to its effectiveness. Finally, our evaluation demonstrates that FdLoop effectively achieves single testing goals (revealing erroneous behaviors, generating complex inputs, or inducing long execution time) and scales to multiple testing goals across varying parameter settings.
Related papers
- LLM-based Unit Test Generation for Dynamically-Typed Programs [16.38145000434927]
TypeTest is a novel framework that enhances type correctness in test generation through a vector-based Retrieval-Augmented Generation system.<n>In an evaluation on 125 real-world Python modules, TypeTest achieved an average statement coverage of 86.6% and branch coverage of 76.8%, outperforming state-of-theart tools by 5.4% and 9.3%, respectively.
arXiv Detail & Related papers (2025-03-18T08:07:17Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.<n>Our method is able to work under gray-box conditions without access to model training data or weights.<n>We evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - PROZE: Generating Parameterized Unit Tests Informed by Runtime Data [10.405775369526006]
A parameterized unit test (PUT) receives a set of inputs as arguments and contains assertions that are expected to hold true for all these inputs.
In this paper, we address the problem of finding oracles for PUTs that hold over multiple inputs.
We design a system called PROZE, that generates PUTs by identifying developer-written assertions that are valid for more than one test input.
arXiv Detail & Related papers (2024-06-30T17:07:12Z) - LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs [37.48856389469826]
TrickCatcher generates test cases for uncovering bugs in plausible programs.<n>TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T06:20:06Z) - Test-Driven Development for Code Generation [0.850206009406913]
Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements.
This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
arXiv Detail & Related papers (2024-02-21T04:10:12Z) - Test Generation Strategies for Building Failure Models and Explaining
Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic.
We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures.
We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z) - Generative Input: Towards Next-Generation Input Methods Paradigm [49.98958865125018]
We propose a novel Generative Input paradigm named GeneInput.
It uses prompts to handle all input scenarios and other intelligent auxiliary input functions, optimizing the model with user feedback to deliver personalized results.
The results demonstrate that we have achieved state-of-the-art performance for the first time in the Full-mode Key-sequence to Characters(FK2C) task.
arXiv Detail & Related papers (2023-11-02T12:01:29Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Intergenerational Test Generation for Natural Language Processing
Applications [16.63835131985415]
We propose an automated test generation method for detecting erroneous behaviors of various NLP applications.
We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences.
NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks.
arXiv Detail & Related papers (2023-02-21T07:57:59Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Synthetic Datasets for Neural Program Synthesis [66.20924952964117]
We propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications.
We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.
arXiv Detail & Related papers (2019-12-27T21:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.