Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing
- URL: http://arxiv.org/abs/2509.01616v1
- Date: Mon, 01 Sep 2025 16:54:24 GMT
- Title: Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing
- Authors: Konstantinos Kitsios, Marco Castelluccio, Alberto Bacchelli,
- Abstract summary: Issue-reproducing tests fail on buggy code and pass once a patch is applied.<n>Past research has shown that developers often commit patches without such tests.<n>We propose a tool for automatically generating issue-reproducing tests from issue-patch pairs.
- Score: 5.008597638379228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Issue-reproducing tests fail on buggy code and pass once a patch is applied, thus increasing developers' confidence that the issue has been resolved and will not be re-introduced. However, past research has shown that developers often commit patches without such tests, making the automated generation of issue-reproducing tests an area of interest. We propose BLAST, a tool for automatically generating issue-reproducing tests from issue-patch pairs by combining LLMs and search-based software testing (SBST). For the LLM part, we complement the issue description and the patch by extracting relevant context through git history analysis, static analysis, and SBST-generated tests. For the SBST part, we adapt SBST for generating issue-reproducing tests; the issue description and the patch are fed into the SBST optimization through an intermediate LLM-generated seed, which we deserialize into SBST-compatible form. BLAST successfully generates issue-reproducing tests for 151/426 (35.4%) of the issues from a curated Python benchmark, outperforming the state-of-the-art (23.5%). Additionally, to measure the real-world impact of BLAST, we built a GitHub bot that runs BLAST whenever a new pull request (PR) linked to an issue is opened, and if BLAST generates an issue-reproducing test, the bot proposes it as a comment in the PR. We deployed the bot in three open-source repositories for three months, gathering data from 32 PRs-issue pairs. BLAST generated an issue-reproducing test in 11 of these cases, which we proposed to the developers. By analyzing the developers' feedback, we discuss challenges and opportunities for researchers and tool builders. Data and material: https://doi.org/10.5281/zenodo.16949042
Related papers
- Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation [20.31612139450269]
Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
arXiv Detail & Related papers (2026-01-16T02:08:16Z) - AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [0.7564784873669823]
We introduce AssertFlip, a technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs)<n>AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present.<n>Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs.
arXiv Detail & Related papers (2025-07-23T14:19:55Z) - Detecção de Conflitos Semânticos com Testes Gerados por LLM [1.201626478128059]
We propose and integrate a new test generation tool based on Code Llama 70B into SMAT.<n>SMAT relies on generating and executing unit tests: if a test fails on the base version, passes on a developer's modified version, but fails again after merging with another developer's changes, a semantic conflict is indicated.<n>Results indicate that, although LLM-based test generation remains challenging and computationally expensive in complex scenarios, there is promising potential for improving semantic conflict detection.
arXiv Detail & Related papers (2025-07-09T11:38:53Z) - UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench [8.00058513405915]
We introduce UTGenerator, an LLM-driven test case generator to generate test cases for real-world Python projects.<n>Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation.<n>In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench.
arXiv Detail & Related papers (2025-06-10T22:56:49Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - Issue2Test: Generating Reproducing Test Cases from Issue Reports [21.28421180698285]
A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue.<n>This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report.<n>We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 30.4 of the issues.
arXiv Detail & Related papers (2025-03-20T16:44:00Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL [46.65963514391019]
AutoRestTest is a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing.
arXiv Detail & Related papers (2025-01-15T05:54:33Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.