Related papers: Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

URL: http://arxiv.org/abs/2503.16144v1
Date: Thu, 20 Mar 2025 13:47:06 GMT
Title: Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs
Authors: Djamel Eddine Khelladi, Charly Reux, Mathieu Acher,
Abstract summary: PolyTest is a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity.<n>PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality.
Score: 5.366354612549172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs' ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests -- same input, different expected outputs -- across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics -- number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.

Related papers

TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code. TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging [10.334617290353192]
We evaluate whether Scientific computation can help Large Language Models (LLMs) to generate tests for mutants.<n>LLMs consistently outperform Pynguin in generating tests with better fault detection and coverage.<n> Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.
arXiv Detail & Related papers (2025-03-11T08:47:13Z)
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs. ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z)
Improving the Readability of Automatically Generated Tests using Large Language Models [7.7149881834358345]
We propose to combine the effectiveness of search-based generators with the readability of LLM generated tests. Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics unchanged.
arXiv Detail & Related papers (2024-12-25T09:08:53Z)
Do LLMs generate test oracles that capture the actual or the expected program behaviour? [7.772338538073763]
Large Language Models (LLMs) are trained on an enormous amount of data to generate developer-like code and test cases. This study includes developer-written and automatically generated test cases and oracles for 24 open-source Java repositories. LLMs are better at generating test oracles rather than classifying the correct ones, and can generate better test oracles when the code contains meaningful test or variable names.
arXiv Detail & Related papers (2024-10-28T15:37:06Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases. We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs. Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency [45.6224547703717]
This study focuses on tests of silent sentence reading efficiency, used to assess students' reading ability over time. We propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items. We show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses.
arXiv Detail & Related papers (2023-10-10T17:59:51Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing. We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes. We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.