Related papers: Repair-R1: Better Test Before Repair

Repair-R1: Better Test Before Repair

URL: http://arxiv.org/abs/2507.22853v1
Date: Wed, 30 Jul 2025 17:24:05 GMT
Title: Repair-R1: Better Test Before Repair
Authors: Haichuan Hu, Xiaochen Xie, Quanjun Zhang,
Abstract summary: APR aims to automatically locate program defects, generate patches and validate the repairs.<n>Current APR methods typically utilize test cases only during the inference stage.<n>We propose Repair-R1, which introduces test cases into the model's training phase and shifts test generation to precede repair.
Score: 2.982543556561469
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: APR (Automated Program Repair) aims to automatically locate program defects, generate patches and validate the repairs. Existing techniques for APR are often combined with LLMs (Large Language Models), which leverages the code-related knowledge of LLMs to improve repair effectiveness. Current LLM-based APR methods typically utilize test cases only during the inference stage, adopting an iterative approach that performs repair first and validates it through test execution afterward. This conventional paradigm neglects two important aspects: the potential contribution of test cases in the training phase, and the possibility of leveraging testing prior to repair. To address this, we propose Repair-R1, which introduces test cases into the model's training phase and shifts test generation to precede repair. The model is required to first generate discriminative test cases that can distinguish defective behaviors, and then perform repair based on these tests. This enables the model to better locate defects and understand the underlying causes of defects, thereby improving repair effectiveness. We implement Repair-R1 with three different backbone models, using RL (reinforcement learning) to co-optimize test generation and bug repair. Experimental results on four widely adopted benchmarks demonstrate the superiority of Repair-R1. Specially, compared to vanilla models, Repair-R1 improves repair success rate by 2.68\% to 48.29\%, test generation success rate by 16.38\% to 53.28\%, and test coverage by 0.78\% to 53.96\%. We publish the code and weights at https://github.com/Tomsawyerhu/APR-RL and https://huggingface.co/tomhu/Qwen3-4B-RL-5000-step.

Related papers

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty [59.97939500426759]
This paper describes RLCR, an approach to training reasoning models that jointly improves accuracy and confidence estimation.<n>We show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy.<n>We also demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration.
arXiv Detail & Related papers (2025-07-22T17:56:01Z)
Input Reduction Enhanced LLM-based Program Repair [2.098274800451098]
Test inputs are crucial for reasoning the root cause of failures.<n>When the test inputs are extensive in the prompt, this may trigger the "lost-in-the-middle" issue, compromising repair performance.<n>We propose ReduceFix, an APR approach with a built-in component that automatically reduces test inputs while retaining their failure-inducing behavior.
arXiv Detail & Related papers (2025-07-21T05:26:32Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search [41.50068103527948]
We propose ReinFix, a framework that searches for repair ingredients throughout the reasoning and solution phases of bug fixing.<n>During the solution phase, ReinFix searches for external ingredients from historical bug fixes with similar bug patterns.<n> Evaluations on two popular benchmarks demonstrate the effectiveness of our approach over SOTA baselines.
arXiv Detail & Related papers (2025-06-29T06:02:11Z)
The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models [48.073219761367184]
We investigate an APR pipeline that balances the generation of multiple outputs and multiple rounds of iteration.<n>We fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA)<n>Our results show that by using only a fraction (1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated.
arXiv Detail & Related papers (2025-05-05T18:06:51Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs [23.419180504723546]
ContrastRepair is a novel APR approach that augments conversation-driven APR by providing contrastive test pairs. We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java.
arXiv Detail & Related papers (2024-03-04T12:15:28Z)
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases.<n>LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices.<n>We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair [0.5749787074942512]
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test. In this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category.
arXiv Detail & Related papers (2023-06-21T19:34:16Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.