Related papers: The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

URL: http://arxiv.org/abs/2505.02931v1
Date: Mon, 05 May 2025 18:06:51 GMT
Title: The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models
Authors: Fernando Vallecillos Ruiz, Max Hort, Leon Moonen,
Abstract summary: We investigate an APR pipeline that balances the generation of multiple outputs and multiple rounds of iteration.<n>We fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA)<n>Our results show that by using only a fraction (1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated.
Score: 48.073219761367184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement.

Related papers

The Impact of Fine-tuning Large Language Models on Automated Program Repair [5.868532677577195]
Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster.<n>Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility.<n>Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch.
arXiv Detail & Related papers (2025-07-26T10:42:08Z)
Accelerating Automatic Program Repair with Dual Retrieval-Augmented Fine-Tuning and Patch Generation on Large Language Models [28.75106676284909]
We propose SelRepair, a novel APR approach with integration of a fine-tuned LLM with a newly-designed dual RAG module.<n>This approach uses a bug-fix pair dataset for fine-tuning and incorporates semantic and syntactic/structural similarity information through an RAG selection gate.<n> Evaluations on Java datasets show SelRepair outperforms other APR methods, achieving 26.29% and 17.64% in terms of exact match (EM) on different datasets while reducing inference time by at least 6.42% with controlled input lengths.
arXiv Detail & Related papers (2025-07-14T09:41:51Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Unlocking Recursive Thinking of LLMs: Alignment via Refinement [27.702786437714888]
We propose textbfAvR: textbfAlignment via Refinement, a novel method aimed at unlocking the potential of Large Language Models.<n>With only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0.
arXiv Detail & Related papers (2025-06-06T11:54:06Z)
Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models [4.757323827658957]
Automated Program Repair proposes bug fixes to aid developers in maintaining software.<n>Recent works have shown that LLMs can be used to generate repairs.<n>We evaluate a diverse set of 13 recent models, including open ones (e.g., Llama 3.3, Qwen 2.5 Coder, and DeepSeek R1 (dist.)) and closed ones (e.g., o3-mini, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash)
arXiv Detail & Related papers (2025-06-03T18:15:14Z)
Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation [22.100878392437508]
We conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs.<n>We propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code.
arXiv Detail & Related papers (2025-04-17T03:13:39Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Semantic-guided Search for Efficient Program Repair with Large Language Models [0.9319432628663639]
FLAMES employs semantic-guided patch generation to enhance repair effectiveness and memory efficiency. FLAMES substantially reduces memory consumption by up to 83% compared to conventional LLM-based APR. Remarkably, FLAMES successfully generated 133 and 103 correct fixes for 333 and 163 bugs in the Defects4J and HumanEval-Java datasets.
arXiv Detail & Related papers (2024-10-22T02:59:47Z)
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning [60.55556283848063]
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. We propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard.
arXiv Detail & Related papers (2024-09-18T17:12:41Z)
RePair: Automated Program Repair with Process-based Feedback [28.017321930042694]
We show how small-scale language models (LM) can achieve excellent performance through process supervision and feedback. We develop a reward model that serves as a critic, providing feedback for the fine-tuned LM's action. The results show that process-based not only outperforms larger outcome-based generation methods, but also nearly matches the performance of closed-source commercial large-scale LMs.
arXiv Detail & Related papers (2024-08-21T02:53:23Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Enhancing Redundancy-based Automated Program Repair by Fine-grained Pattern Mining [18.3896381051331]
We propose a new repair technique named Repatt, which incorporates a two-level pattern mining process for guiding effective patch generation. We have conducted an experiment on the widely-used Defects4J benchmark and compared Repatt with eight state-of-the-art APR approaches.
arXiv Detail & Related papers (2023-12-26T08:42:32Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.