Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing
- URL: http://arxiv.org/abs/2402.11892v2
- Date: Wed, 13 Nov 2024 06:54:05 GMT
- Title: Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing
- Authors: Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray,
- Abstract summary: We first examine the naturalness of semantic-preserving transformations through a two-stage human study.
Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations.
- Score: 2.763736939516234
- License:
- Abstract: In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.
Related papers
- Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm [12.201705893125775]
We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit.
Applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy.
We create a benchmark to evaluate estimator accuracy using synthetic outcomes.
arXiv Detail & Related papers (2024-09-06T15:44:45Z) - Just rotate it! Uncertainty estimation in closed-source models via multiple queries [3.8121150313479655]
We propose a simple and effective method to estimate the uncertainty of closed-source deep neural network image classification models.
We demonstrate significant improvements in the calibration of uncertainty estimates compared to the naive baseline of assigning 100% confidence to all predictions.
arXiv Detail & Related papers (2024-05-22T17:45:38Z) - Topology-preserving Adversarial Training for Alleviating Natural Accuracy Degradation [27.11004064848789]
Adversarial training has suffered from the natural accuracy degradation problem.
We propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem.
We show TRAIN achieves up to 8.86% improvement in natural accuracy and 6.33% improvement in robust accuracy.
arXiv Detail & Related papers (2023-11-29T13:05:06Z) - Effective Restoration of Source Knowledge in Continual Test Time
Adaptation [44.17577480511772]
This paper introduces an unsupervised domain change detection method that is capable of identifying domain shifts in dynamic environments.
By restoring the knowledge from the source, it effectively corrects the negative consequences arising from the gradual deterioration of model parameters.
We perform extensive experiments on benchmark datasets to demonstrate the superior performance of our method compared to state-of-the-art adaptation methods.
arXiv Detail & Related papers (2023-11-08T19:21:48Z) - Understanding Robust Overfitting from the Feature Generalization Perspective [61.770805867606796]
Adversarial training (AT) constructs robust neural networks by incorporating adversarial perturbations into natural data.
It is plagued by the issue of robust overfitting (RO), which severely damages the model's robustness.
In this paper, we investigate RO from a novel feature generalization perspective.
arXiv Detail & Related papers (2023-10-01T07:57:03Z) - Improved Factorized Neural Transducer Model For text-only Domain Adaptation [14.65352101664147]
Adapting End-to-End ASR models to out-of-domain datasets with text data is challenging.
Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary.
We present the improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information.
arXiv Detail & Related papers (2023-09-18T07:02:04Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Can Transformers be Strong Treatment Effect Estimators? [86.32484218657166]
We develop a general framework based on the Transformer architecture to address a variety of treatment effect estimation problems.
Our methods are applied to discrete, continuous, structured, or dosage-associated treatments.
Our experiments with Transformers as Treatment Effect Estimators (TransTEE) demonstrate that these inductive biases are also effective on the sorts of estimation problems and datasets that arise in research aimed at estimating causal effects.
arXiv Detail & Related papers (2022-02-02T23:56:42Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.