Is this Change the Answer to that Problem? Correlating Descriptions of
Bug and Code Changes for Evaluating Patch Correctness
- URL: http://arxiv.org/abs/2208.04125v1
- Date: Mon, 8 Aug 2022 13:32:58 GMT
- Title: Is this Change the Answer to that Problem? Correlating Descriptions of
Bug and Code Changes for Evaluating Patch Correctness
- Authors: Haoye Tian, Xunzhu Tang, Andrew Habib, Shangwen Wang, Kui Liu, Xin
Xia, Jacques Klein, Tegawend\'e F. Bissyand\'e
- Abstract summary: We turn the patch correctness assessment into a Question Answering problem.
We consider as inputs the bug reports as well as the natural language description of the generated patches.
Experiments show that Quatrain can achieve an AUC of 0.886 on predicting patch correctness.
- Score: 8.606215760860362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a novel perspective to the problem of patch
correctness assessment: a correct patch implements changes that "answer" to a
problem posed by buggy behaviour. Concretely, we turn the patch correctness
assessment into a Question Answering problem. To tackle this problem, our
intuition is that natural language processing can provide the necessary
representations and models for assessing the semantic correlation between a bug
(question) and a patch (answer). Specifically, we consider as inputs the bug
reports as well as the natural language description of the generated patches.
Our approach, Quatrain, first considers state of the art commit message
generation models to produce the relevant inputs associated to each generated
patch. Then we leverage a neural network architecture to learn the semantic
correlation between bug reports and commit messages. Experiments on a large
dataset of 9135 patches generated for three bug datasets (Defects4j, Bugs.jar
and Bears) show that Quatrain can achieve an AUC of 0.886 on predicting patch
correctness, and recalling 93% correct patches while filtering out 62%
incorrect patches. Our experimental results further demonstrate the influence
of inputs quality on prediction performance. We further perform experiments to
highlight that the model indeed learns the relationship between bug reports and
code change descriptions for the prediction. Finally, we compare against prior
work and discuss the benefits of our approach.
Related papers
- Learning to Represent Patches [7.073203009308308]
We introduce a novel method, Patcherizer, to bridge the gap between deep learning for patch representation and semantic intent.
Patcherizer employs graph convolutional neural networks for structural intention graph representation and transformers for intention sequence representation.
Our experiments demonstrate the representation's efficacy across all tasks, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2023-08-31T09:34:38Z) - Invalidator: Automated Patch Correctness Assessment via Semantic and
Syntactic Reasoning [6.269370220586248]
In this paper, we propose a novel technique to automatically assess the correctness of APR-generated patches via semantic and syntactic reasoning.
We have conducted experiments on a dataset of 885 patches generated on real-world programs in Defects4J.
Experiment results show that INVALIDATOR correctly classified 79% overfitting patches, accounting for 23% more overfitting patches being detected by the best baseline.
arXiv Detail & Related papers (2023-01-03T14:16:32Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - Fixing Model Bugs with Natural Language Patches [38.67529353406759]
We explore natural language patches that allow developers to provide corrective feedback at the right level of abstraction.
We show that with a small amount of synthetic data, we can teach models to effectively use real patches on real data.
We also show that finetuning on as many as 100 labeled examples may be needed to match the performance of a small set of language patches.
arXiv Detail & Related papers (2022-11-07T05:49:19Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Checking Patch Behaviour against Test Specification [4.723400023753107]
We propose a hypothesis on how the link between the patch behaviour and failing test specifications can be drawn.
We then propose BATS, an unsupervised learning-based system to predict patch correctness.
arXiv Detail & Related papers (2021-07-28T11:39:06Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.