Invalidator: Automated Patch Correctness Assessment via Semantic and
Syntactic Reasoning
- URL: http://arxiv.org/abs/2301.01113v1
- Date: Tue, 3 Jan 2023 14:16:32 GMT
- Title: Invalidator: Automated Patch Correctness Assessment via Semantic and
Syntactic Reasoning
- Authors: Thanh Le-Cong, Duc-Minh Luong, Xuan Bach D. Le, David Lo, Nhat-Hoa
Tran, Bui Quang-Huy and Quyet-Thang Huynh
- Abstract summary: In this paper, we propose a novel technique to automatically assess the correctness of APR-generated patches via semantic and syntactic reasoning.
We have conducted experiments on a dataset of 885 patches generated on real-world programs in Defects4J.
Experiment results show that INVALIDATOR correctly classified 79% overfitting patches, accounting for 23% more overfitting patches being detected by the best baseline.
- Score: 6.269370220586248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel technique, namely INVALIDATOR, to
automatically assess the correctness of APR-generated patches via semantic and
syntactic reasoning. INVALIDATOR reasons about program semantic via program
invariants while it also captures program syntax via language semantic learned
from large code corpus using the pre-trained language model. Given a buggy
program and the developer-patched program, INVALIDATOR infers likely invariants
on both programs. Then, INVALIDATOR determines that a APR-generated patch
overfits if: (1) it violates correct specifications or (2) maintains errors
behaviors of the original buggy program. In case our approach fails to
determine an overfitting patch based on invariants, INVALIDATOR utilizes a
trained model from labeled patches to assess patch correctness based on program
syntax. The benefit of INVALIDATOR is three-fold. First, INVALIDATOR is able to
leverage both semantic and syntactic reasoning to enhance its discriminant
capability. Second, INVALIDATOR does not require new test cases to be generated
but instead only relies on the current test suite and uses invariant inference
to generalize the behaviors of a program. Third, INVALIDATOR is fully
automated. We have conducted our experiments on a dataset of 885 patches
generated on real-world programs in Defects4J. Experiment results show that
INVALIDATOR correctly classified 79% overfitting patches, accounting for 23%
more overfitting patches being detected by the best baseline. INVALIDATOR also
substantially outperforms the best baselines by 14% and 19% in terms of
Accuracy and F-Measure, respectively.
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - PatchZero: Zero-Shot Automatic Patch Correctness Assessment [13.19425284402493]
We propose toolname, the patch correctness assessment by adopting a large language model for code.
toolname prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools.
Our experimental results showed that toolname can achieve an accuracy of 84.4% and an F1-score of 86.5% on average.
arXiv Detail & Related papers (2023-03-01T03:12:11Z) - Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection
by Distorting Task-Agnostic Features [14.325845491628087]
Out-of-distribution (OOD) inputs are crucial for the safe deployment of natural language processing (NLP) models.
We take the first step to evaluate the mainstream textual OOD detection methods for detecting semantic and non-semantic shifts.
We present a simple yet effective general OOD score named GNOME that integrates the confidence scores derived from the task-agnostic and task-specific representations.
arXiv Detail & Related papers (2023-01-30T08:01:13Z) - APPT: Boosting Automated Patch Correctness Prediction via Fine-tuning
Pre-trained Models [15.179895484968476]
We propose APPT, a pre-trained model-based automated patch correctness assessment technique by both pre-training and fine-tuning.
We conduct an experiment on 1,183 Defects4J patches and the experimental results show that APPT achieves prediction accuracy of 79.7% and recall of 83.2%.
arXiv Detail & Related papers (2023-01-29T14:28:26Z) - Checking Patch Behaviour against Test Specification [4.723400023753107]
We propose a hypothesis on how the link between the patch behaviour and failing test specifications can be drawn.
We then propose BATS, an unsupervised learning-based system to predict patch correctness.
arXiv Detail & Related papers (2021-07-28T11:39:06Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Adversarial Transfer Learning for Punctuation Restoration [58.2201356693101]
Adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction.
Experiments are conducted on IWSLT2011 datasets.
arXiv Detail & Related papers (2020-04-01T06:19:56Z) - Rectifying Pseudo Label Learning via Uncertainty Estimation for Domain
Adaptive Semantic Segmentation [49.295165476818866]
This paper focuses on the unsupervised domain adaptation of transferring the knowledge from the source domain to the target domain in the context of semantic segmentation.
Existing approaches usually regard the pseudo label as the ground truth to fully exploit the unlabeled target-domain data.
This paper proposes to explicitly estimate the prediction uncertainty during training to rectify the pseudo label learning.
arXiv Detail & Related papers (2020-03-08T12:37:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.