ContrastRepair: Enhancing Conversation-Based Automated Program Repair
via Contrastive Test Case Pairs
- URL: http://arxiv.org/abs/2403.01971v2
- Date: Thu, 7 Mar 2024 05:33:36 GMT
- Title: ContrastRepair: Enhancing Conversation-Based Automated Program Repair
via Contrastive Test Case Pairs
- Authors: Jiaolong Kong, Mingfei Cheng, Xiaofei Xie, Shangqing Liu, Xiaoning Du,
Qi Guo
- Abstract summary: ContrastRepair is a novel APR approach that augments conversation-driven APR by providing contrastive test pairs.
We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java.
- Score: 23.419180504723546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Program Repair (APR) aims to automatically generate patches for
rectifying software bugs. Recent strides in Large Language Models (LLM), such
as ChatGPT, have yielded encouraging outcomes in APR, especially within the
conversation-driven APR framework. Nevertheless, the efficacy of
conversation-driven APR is contingent on the quality of the feedback
information. In this paper, we propose ContrastRepair, a novel
conversation-based APR approach that augments conversation-driven APR by
providing LLMs with contrastive test pairs. A test pair consists of a failing
test and a passing test, which offer contrastive feedback to the LLM. Our key
insight is to minimize the difference between the generated passing test and
the given failing test, which can better isolate the root causes of bugs. By
providing informative and specific feedback, ContrastRepair enables the LLM to
produce effective bug fixes. The implementation of ContrastRepair is based on
the state-of-the-art LLM, ChatGPT, and it iteratively interacts with ChatGPT
until plausible patches are generated. We evaluate ContrastRepair on multiple
benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java. The
results demonstrate that ContrastRepair significantly outperforms existing
methods, achieving a new state-of-the-art in program repair. For instance,
among Defects4j 1.2 and 2.0, ContrastRepair correctly repairs 143 out of all
337 bug cases, while the best-performing baseline fixes 124 bugs.
Related papers
- MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.
We introduce a universal and training-free framework, $textbfMQM-APE, to enhance the quality of error annotations predicted by LLM evaluators.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models [48.42142115255159]
We release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task.
We evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs.
Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings.
arXiv Detail & Related papers (2024-09-21T21:06:25Z) - ThinkRepair: Self-Directed Automated Program Repair [11.598008952093487]
Large language models (LLMs) instructed by prompt engineering have attracted much attention for their powerful ability to address many kinds of tasks including bug-fixing.
We propose a self-directed LLM-based automated program repair, ThinkRepair, with two main phases: collection phase and fixing phase.
Evaluations on two widely studied datasets (Defects4J and QuixBugs) by comparing ThinkRepair with 12 SOTA APRs indicate the priority of ThinkRepair in fixing bugs.
arXiv Detail & Related papers (2024-07-30T15:17:07Z) - Hybrid Automated Program Repair by Combining Large Language Models and Program Analysis [12.7034916462208]
Automated Program Repair (APR) has garnered significant attention due to its potential to streamline the bug repair process for human developers.
This paper introduces an innovative APR approach called GIANTREPAIR.
Based on this insight, GIANTREPAIR first constructs patch skeletons from LLM-generated patches to confine the patch space, and then generates high-quality patches tailored to specific programs.
arXiv Detail & Related papers (2024-06-03T05:05:12Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair [19.123640635549524]
Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of software engineering tasks.
This paper reviews the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives.
ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5% and 62.4% prediction accuracy.
arXiv Detail & Related papers (2023-10-13T06:11:47Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each
using ChatGPT [10.071615423169902]
Automated Program Repair (APR) aims to automatically generate patches for buggy programs.
Recent APR work has been focused on leveraging modern Large Language Models (LLMs) to directly generate patches for APR.
We propose ChatRepair, the first fully automated conversation-driven APR approach.
arXiv Detail & Related papers (2023-04-01T20:57:33Z) - Conversational Automated Program Repair [10.071615423169902]
We propose a new paradigm for program repair that alternates between patch generation and validation in a conversational manner.
We leverage the long-term context window of Large Pre-Trained Language Models to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test.
arXiv Detail & Related papers (2023-01-30T19:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.