Related papers: Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation

Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation

URL: http://arxiv.org/abs/2601.10942v1
Date: Fri, 16 Jan 2026 02:08:16 GMT
Title: Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation
Authors: Zitong Zhou, Matteo Paltenghi, Miryung Kim, Michael Pradel,
Abstract summary: Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
Score: 20.31612139450269
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software is in constant evolution, with developers frequently submitting pull requests (PRs) to introduce new features or fix bugs. Testing PRs is critical to maintaining software quality. Yet, even in projects with extensive test suites, some PR-modified lines remain untested, leaving a "last-mile" regression test gap. Existing test generators typically aim to improve overall coverage, but do not specifically target the uncovered lines in PRs. We present Change And Cover (ChaCo), an LLM-based test augmentation technique that addresses this gap. It makes three contributions: (i) ChaCo considers the PR-specific patch coverage, offering developers augmented tests for code just when it is on the developers' mind. (ii) We identify providing suitable test context as a crucial challenge for an LLM to generate useful tests, and present two techniques to extract relevant test content, such as existing test functions, fixtures, and data generators. (iii) To make augmented tests acceptable for developers, ChaCo carefully integrates them into the existing test suite, e.g., by matching the test's structure and style with the existing tests, and generates a summary of the test addition for developer review. We evaluate ChaCo on 145 PRs from three popular and complex open-source projects - SciPy, Qiskit, and Pandas. The approach successfully helps 30% of PRs achieve full patch coverage, at the cost of $0.11, showing its effectiveness and practicality. Human reviewers find the tests to be worth adding (4.53/5.0), well integrated (4.2/5.0), and relevant to the PR (4.7/5.0). Ablations show test context is crucial for context-aware test generation, leading to 2x coverage. We submitted 12 tests, of which 8 have already been merged, and two previously unknown bugs were exposed and fixed. We envision our approach to be integrated into CI workflows, automating the last mile of regression test augmentation.

Related papers

CodeContests-O: Powering LLMs via Feedback-Driven Iterative Test Case Generation [71.42965967582147]
Existing approaches attempt to synthesize test cases using Large Language Models (LLMs)<n>We propose a $textbfFeedback-Bench Iterative Framework$ for comprehensive test case construction.<n>Our dataset achieves an average True Positive Rate (TPR) of $89.37%$ and True Negative Rate (TNR) of $90.89%$, significantly outperforming the CodeContests and CodeContests+ by margins of $4.32%$ and $9.37%$, respectively.
arXiv Detail & Related papers (2026-01-20T07:32:44Z)
When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution [8.305144449617883]
TestPrune is a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation.<n>TestPrune can be plugged into any agentic bug repair pipeline and inflately improve overall performance.
arXiv Detail & Related papers (2025-10-21T03:42:28Z)
Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z)
Intention-Driven Generation of Project-Specific Test Cases [45.2380093475221]
We propose IntentionTest, which generates project-specific tests given the description of validation intention.<n>We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects.
arXiv Detail & Related papers (2025-07-28T08:35:04Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.<n>We investigated 207 versions of 6 open-source projects.<n>Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Issue2Test: Generating Reproducing Test Cases from Issue Reports [17.854783249394913]
A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue.<n>This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report.<n>We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 32.9% of the issues.
arXiv Detail & Related papers (2025-03-20T16:44:00Z)
TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.<n>It covers initial tests authoring, test suite completion, and code coverage improvements.<n>We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
Retrieval-Augmented Test Generation: How Far Are We? [10.473792371852015]
We investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs.<n>We examine three domain-specific sources for RAG: API documentation (official guidelines), GitHub issues (developer-reported resolutions), and StackOverflow Q&As.<n>Our study focuses on five widely used Python-based ML/DL libraries, PyTorch, Scikit-learn, Google JAX, and XGBoost.
arXiv Detail & Related papers (2024-09-19T11:48:29Z)
Constrained C-Test Generation via Mixed-Integer Programming [55.28927994487036]
This work proposes a novel method to generate C-Tests; a form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
arXiv Detail & Related papers (2024-04-12T21:35:21Z)
CoverUp: Effective High Coverage Test Generation for Python [0.7673339435080445]
CoverUp is a novel approach to driving the generation of high-coverage Python regression tests.<n>CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests.<n>Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80%.
arXiv Detail & Related papers (2024-03-24T16:18:27Z)
Automated Unit Test Improvement using Large Language Models at Meta [44.87533111512982]
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms.
arXiv Detail & Related papers (2024-02-14T13:43:14Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.