Related papers: Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair

Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair

URL: http://arxiv.org/abs/2503.15050v2
Date: Wed, 09 Apr 2025 14:18:47 GMT
Title: Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair
Authors: Aolin Chen, Haojun Wu, Qi Xin, Steven P. Reiss, Jifeng Xuan,
Abstract summary: Automated program repair (APR) is designed to automate the process of bug-fixing.<n>Advanced APR techniques powered by conversational language models (LLMs) have exhibited impressive repair abilities.<n>Despite the superiority, conversational APR techniques still fail to repair a large number of bugs.
Score: 3.93048798243871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated program repair (APR) is designed to automate the process of bug-fixing. In recent years, thanks to the rapid development of large language models (LLMs), automated repair has achieved remarkable progress. Advanced APR techniques powered by conversational LLMs, most notably ChatGPT, have exhibited impressive repair abilities and gained increasing popularity due to the capabilities of the underlying LLMs in providing repair feedback and performing iterative patch improvement. Despite the superiority, conversational APR techniques still fail to repair a large number of bugs. For example, a state-of-the-art conversational technique ChatRepair does not correctly repair over half of the single-function bugs in the Defects4J dataset. To understand the effectiveness and failures of conversational LLM-based repair and provide possible directions for improvement, we studied the exemplary ChatRepair with a focus on comparing the effectiveness of its cloze-style and full function repair strategies, assessing its key iterative component for patch improvement, and analyzing the repair failures. Our study has led to a series of findings, which we believe provide key implications for future research.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
FastFixer: An Efficient and Effective Approach for Repairing Programming Assignments [21.848112758958543]
We propose FastFixer, an efficient and effective approach for programming assignment repair. We first propose a novel repair-oriented fine-tuning strategy, aiming to enhance the LLM's attention towards learning how to generate the necessary patch and its associated context. Considering the repair efficiency, FastFixer achieves a remarkable inference speedup of 16.67 times compared to the autoregressive decoding algorithm.
arXiv Detail & Related papers (2024-10-11T10:17:02Z)
Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models [48.42142115255159]
We release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task. We evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings.
arXiv Detail & Related papers (2024-09-21T21:06:25Z)
On The Effectiveness of Dynamic Reduction Techniques in Automated Program Repair [1.7767466724342067]
We describe a program repair framework that effectively handles large-scale buggy programs of industrial complexity. The framework exploits program reduction in the form of program slicing to eliminate parts of the code irrelevant to the bug being repaired. Our empirical results on the widely used Defects4J dataset reveal that a substantial improvement in performance can be obtained without any degradation in repair quality.
arXiv Detail & Related papers (2024-06-23T21:35:07Z)
Investigating the Transferability of Code Repair for Low-Resource Programming Languages [57.62712191540067]
Large language models (LLMs) have shown remarkable performance on code generation tasks. Recent works augment the code repair process by integrating modern techniques such as chain-of-thought reasoning or distillation. We investigate the benefits of distilling code repair for both high and low resource languages.
arXiv Detail & Related papers (2024-06-21T05:05:39Z)
How Far Can We Go with Practical Function-Level Program Repair? [11.71750828464698]
This paper investigates the effect of few-shot learning mechanism and the auxiliary repair-relevant information on function-level APR. We propose an LLM-based function-level APR technique, namely SRepair, which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information.
arXiv Detail & Related papers (2024-04-19T12:14:09Z)
ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs [23.419180504723546]
ContrastRepair is a novel APR approach that augments conversation-driven APR by providing contrastive test pairs. We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java.
arXiv Detail & Related papers (2024-03-04T12:15:28Z)
A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back. Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair. This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z)
The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model [15.885824575879763]
Automatic program repair (APR) techniques have the potential to reduce manual efforts in uncovering and repairing program defects during the code review (CR) process. However, the limited accuracy and considerable time costs associated with existing APR approaches hinder their adoption in industrial practice. Recent advancements in Large Language Models (LLMs) have enhanced their ability to comprehend natural and programming languages, enabling them to generate patches based on review comments.
arXiv Detail & Related papers (2023-12-29T06:12:15Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks. A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output. This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z)
Is Self-Repair a Silver Bullet for Code Generation? [68.02601393906083]
Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance. We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
arXiv Detail & Related papers (2023-06-16T15:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.