Related papers: Teralizer: Semantics-Based Test Generalization from Conventional Unit Tests to Property-Based Tests

Teralizer: Semantics-Based Test Generalization from Conventional Unit Tests to Property-Based Tests

URL: http://arxiv.org/abs/2512.14475v1
Date: Tue, 16 Dec 2025 15:08:00 GMT
Title: Teralizer: Semantics-Based Test Generalization from Conventional Unit Tests to Property-Based Tests
Authors: Johann Glock, Clemens Bauer, Martin Pinzger,
Abstract summary: Teralizer is a prototype for Java that transforms JUnit tests into property-based jqwik tests.<n>We demonstrate this approach through Teralizer, a prototype for Java that transforms JUnit tests into property-based jqwik tests.<n>Generalization of mature developer-written tests from Apache Commons utilities showed only 0.05-0.07 percentage points improvement.
Score: 5.266171160963615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional unit tests validate single input-output pairs, leaving most inputs of an execution path untested. Property-based testing addresses this shortcoming by generating multiple inputs satisfying properties but requires significant manual effort to define properties and their constraints. We propose a semantics-based approach that automatically transforms unit tests into property-based tests by extracting specifications from implementations via single-path symbolic analysis. We demonstrate this approach through Teralizer, a prototype for Java that transforms JUnit tests into property-based jqwik tests. Unlike prior work that generalizes from input-output examples, Teralizer derives specifications from program semantics. We evaluated Teralizer on three progressively challenging datasets. On EvoSuite-generated tests for EqBench and Apache Commons utilities, Teralizer improved mutation scores by 1-4 percentage points. Generalization of mature developer-written tests from Apache Commons utilities showed only 0.05-0.07 percentage points improvement. Analysis of 632 real-world Java projects from RepoReapers highlights applicability barriers: only 1.7% of projects completed the generalization pipeline, with failures primarily due to type support limitations in symbolic analysis and static analysis limitations in our prototype. Based on the results, we provide a roadmap for future work, identifying research and engineering challenges that need to be tackled to advance the field of test generalization. Artifacts available at: https://doi.org/10.5281/zenodo.17950381

Related papers

DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern [6.901203999358967]
We present DiffTester, an acceleration framework specifically tailored for dLLMs in Unit Test Generation (UTG)<n>DiffTester adaptively increases the number of tokens produced at each step without compromising the quality of the output.<n>We extend the original TestEval benchmark, which was limited to Python, by introducing additional programming languages including Java and C++.
arXiv Detail & Related papers (2025-09-29T16:04:18Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Use Property-Based Testing to Bridge LLM Code Generation and Validation [38.25155484701058]
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct is a persistent challenge.<n>This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties.<n>Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle.
arXiv Detail & Related papers (2025-06-23T06:01:12Z)
JustinANN: Realistic Test Generation for Java Programs Driven by Annotations [8.620106576663622]
We propose JustinANN, a flexible and scalable tool to generate test cases for Java programs.<n>Our approach is easier to generate test data in, on and outside the boundaries of the requirement domain.
arXiv Detail & Related papers (2025-05-09T01:31:46Z)
TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs [37.48856389469826]
TrickCatcher generates test cases for uncovering bugs in plausible programs.<n>TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T06:20:06Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports [10.587260348588064]
We introduce BRMiner, a novel approach that leverages Large Language Models (LLMs) in combination with traditional techniques to extract relevant inputs from bug reports.<n>In this study, we evaluate BRMiner using the Defects4J benchmark and test generation tools such as EvoSuite and Randoop.<n>Our results demonstrate that BRMiner achieves a Relevant Input Rate (RIR) of 60.03% and a Relevant Input Extraction Accuracy Rate (RIEAR) of 31.71%.
arXiv Detail & Related papers (2023-12-22T18:19:33Z)
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [64.62570402941387]
We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe.
arXiv Detail & Related papers (2023-11-02T17:59:32Z)
Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood. Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings. In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z)
Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers [10.846226514357866]
Unit testing represents the foundational basis of the software testing pyramid. We present an approach to support developers in writing unit test cases by generating accurate and useful assert statements.
arXiv Detail & Related papers (2020-09-11T19:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.