PatchZero: Zero-Shot Automatic Patch Correctness Assessment
- URL: http://arxiv.org/abs/2303.00202v3
- Date: Fri, 22 Mar 2024 09:09:15 GMT
- Title: PatchZero: Zero-Shot Automatic Patch Correctness Assessment
- Authors: Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, David Lo,
- Abstract summary: We propose toolname, the patch correctness assessment by adopting a large language model for code.
toolname prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools.
Our experimental results showed that toolname can achieve an accuracy of 84.4% and an F1-score of 86.5% on average.
- Score: 13.19425284402493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Program Repair (APR) techniques have shown more and more promising results in fixing real-world bugs. Despite the effectiveness, APR techniques still face an overfitting problem: a generated patch can be incorrect although it passes all tests. It is time-consuming to manually evaluate the correctness of generated patches that can pass all tests. To address this problem, many approaches have been proposed to automatically assess the correctness of patches generated by APR techniques. These approaches are mainly evaluated within the cross-validation setting. However, for patches generated by a new or unseen APR tool, users are implicitly required to manually label a significant portion of these patches in the cross-validation setting before inferring the remaining patches. To mitigate the issue, in this study, we propose \toolname, the patch correctness assessment by adopting a large language model for code. Specifically, for patches generated by a new or unseen APR tool, \toolname does not need labeled patches of this new or unseen APR tool for training but directly queries the large language model for code to get predictions on the correctness labels without training. In this way, \toolname can reduce the manual labeling effort when building a model to automatically assess the correctness of generated patches of new APR tools. \toolname prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools, enhancing the accuracy achieved by \toolname for patches from new APR tools. Our experimental results showed that \toolname can achieve an accuracy of 84.4% and an F1-score of 86.5% on average although no labeled patch of the new or unseen APR tool is available. In addition, our proposed technique outperformed the prior state-of-the-art by a large margin.
Related papers
- Ranking Plausible Patches by Historic Feature Frequencies [4.129445293427074]
This paper presents PrevaRank, a technique that ranks plausible patches according to their feature similarity with historic programmer-written fixes for similar bugs.
PrevaRank consistently improved the ranking of correct fixes.
It works robustly with a variety of APR tools and bugs, with negligible overhead.
arXiv Detail & Related papers (2024-07-24T12:58:14Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - Target before Shooting: Accurate Anomaly Detection and Localization
under One Millisecond via Cascade Patch Retrieval [49.45246833329707]
We re-examine the "matching" nature of Anomaly Detection (AD)
We propose a new AD framework that simultaneously enjoys new records of AD accuracy and dramatically high running speed.
arXiv Detail & Related papers (2023-08-13T11:49:05Z) - Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech
Recognition [49.42732949233184]
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition.
Taking noisy labels as ground-truth in the loss function results in suboptimal performance.
We propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels.
arXiv Detail & Related papers (2023-08-12T12:13:52Z) - APPT: Boosting Automated Patch Correctness Prediction via Fine-tuning
Pre-trained Models [15.179895484968476]
We propose APPT, a pre-trained model-based automated patch correctness assessment technique by both pre-training and fine-tuning.
We conduct an experiment on 1,183 Defects4J patches and the experimental results show that APPT achieves prediction accuracy of 79.7% and recall of 83.2%.
arXiv Detail & Related papers (2023-01-29T14:28:26Z) - Invalidator: Automated Patch Correctness Assessment via Semantic and
Syntactic Reasoning [6.269370220586248]
In this paper, we propose a novel technique to automatically assess the correctness of APR-generated patches via semantic and syntactic reasoning.
We have conducted experiments on a dataset of 885 patches generated on real-world programs in Defects4J.
Experiment results show that INVALIDATOR correctly classified 79% overfitting patches, accounting for 23% more overfitting patches being detected by the best baseline.
arXiv Detail & Related papers (2023-01-03T14:16:32Z) - Test-based Patch Clustering for Automatically-Generated Patches Assessment [21.051652050359852]
Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite.
Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch.
We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior.
arXiv Detail & Related papers (2022-07-22T13:39:27Z) - Checking Patch Behaviour against Test Specification [4.723400023753107]
We propose a hypothesis on how the link between the patch behaviour and failing test specifications can be drawn.
We then propose BATS, an unsupervised learning-based system to predict patch correctness.
arXiv Detail & Related papers (2021-07-28T11:39:06Z) - Coping with Label Shift via Distributionally Robust Optimisation [72.80971421083937]
We propose a model that minimises an objective based on distributionally robust optimisation (DRO)
We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective.
arXiv Detail & Related papers (2020-10-23T08:33:04Z) - Learning to Purify Noisy Labels via Meta Soft Label Corrector [49.92310583232323]
Recent deep neural networks (DNNs) can easily overfit to biased training data with noisy labels.
Label correction strategy is commonly used to alleviate this issue.
We propose a meta-learning model which could estimate soft labels through meta-gradient descent step.
arXiv Detail & Related papers (2020-08-03T03:25:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.