When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution
- URL: http://arxiv.org/abs/2510.18270v1
- Date: Tue, 21 Oct 2025 03:42:28 GMT
- Title: When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution
- Authors: Yang Chen, Toufique Ahmed, Reyhaneh Jabbarvand, Martin Hirzel,
- Abstract summary: TestPrune is a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation.<n>TestPrune can be plugged into any agentic bug repair pipeline and inflately improve overall performance.
- Score: 8.305144449617883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2%-9.0% relative increase in issue reproduction rate within the Otter framework and a 9.4% - 12.9% relative increase in issue resolution rate within the Agentless framework on SWE-Bench Lite and SWE-Bench Verified benchmarks, capturing fixes that were correctly produced by agents but not submitted as final patches. Compared to the benefits, the cost overhead of using TestPrune is minimal, i.e., \$0.02 and \$0.05 per SWE-Bench instance, using GPT-4o and Claude-3.7-Sonnet models, respectively.
Related papers
- MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation [20.31612139450269]
Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
arXiv Detail & Related papers (2026-01-16T02:08:16Z) - Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis [57.40527331817245]
Test oracle generation in non-regression testing is a longstanding challenge in software engineering.<n>We introduce Nexus, a novel multi-agent framework to address this challenge.
arXiv Detail & Related papers (2025-10-30T12:20:25Z) - Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement [5.8748750353007635]
Test maintenance methods primarily focus on repairing broken tests, neglecting the scenario of enhancing existing tests to verify new functionality.<n>We propose TESTUPDATER, a novel approach that enables automated just-in-time test updates in response to production code changes.<n>TestUPDATER achieves a compilation pass rate of 94.4% and a test pass rate of 86.7%, outperforming the state-of-the-art method SYNTER by 15.9% and 20.0%, respectively.
arXiv Detail & Related papers (2025-09-29T08:08:22Z) - Repair-R1: Better Test Before Repair [2.982543556561469]
APR aims to automatically locate program defects, generate patches and validate the repairs.<n>Current APR methods typically utilize test cases only during the inference stage.<n>We propose Repair-R1, which introduces test cases into the model's training phase and shifts test generation to precede repair.
arXiv Detail & Related papers (2025-07-30T17:24:05Z) - Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets [0.0]
We introduce GPR-bench, a benchmark that operationalizes regression testing for general purpose use cases.<n>We show that newer models generally improve correctness, but the differences are modest and not statistically significant.<n>In contrast, the concise-writing instruction significantly enhances conciseness, demonstrating the effectiveness of prompt engineering.
arXiv Detail & Related papers (2025-05-02T12:31:43Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.