Related papers: Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

URL: http://arxiv.org/abs/2402.11892v2
Date: Wed, 13 Nov 2024 06:54:05 GMT
Title: Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing
Authors: Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray,
Abstract summary: We first examine the naturalness of semantic-preserving transformations through a two-stage human study. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations.
Score: 2.763736939516234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.

Related papers

Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media [0.16385815610837165]
This study provides a comprehensive evaluation of state-of-the-art transformer models against Long Short-Term Memory (LSTM) based approaches.<n>We construct a large annotated dataset using different text embedding techniques for mental health disorder classification on Reddit.<n> Experimental results demonstrate the superior performance of transformer models over traditional deep-learning approaches.
arXiv Detail & Related papers (2025-07-17T04:58:31Z)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on Prediction-Powered Causal Inferences (PPCI)<n> PPCI estimates the treatment effect in a target experiment with unlabeled factual outcomes, retrievable zero-shot from a pre-trained model.<n>We validate our method on synthetic and real-world scientific data, offering solutions to instances not solvable by vanilla Empirical Risk Minimization.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm [12.201705893125775]
We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. We create a benchmark to evaluate estimator accuracy using synthetic outcomes.
arXiv Detail & Related papers (2024-09-06T15:44:45Z)
Just rotate it! Uncertainty estimation in closed-source models via multiple queries [3.8121150313479655]
We propose a simple and effective method to estimate the uncertainty of closed-source deep neural network image classification models. We demonstrate significant improvements in the calibration of uncertainty estimates compared to the naive baseline of assigning 100% confidence to all predictions.
arXiv Detail & Related papers (2024-05-22T17:45:38Z)
Topology-preserving Adversarial Training for Alleviating Natural Accuracy Degradation [27.11004064848789]
Adversarial training has suffered from the natural accuracy degradation problem. We propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem. We show TRAIN achieves up to 8.86% improvement in natural accuracy and 6.33% improvement in robust accuracy.
arXiv Detail & Related papers (2023-11-29T13:05:06Z)
Effective Restoration of Source Knowledge in Continual Test Time Adaptation [44.17577480511772]
This paper introduces an unsupervised domain change detection method that is capable of identifying domain shifts in dynamic environments. By restoring the knowledge from the source, it effectively corrects the negative consequences arising from the gradual deterioration of model parameters. We perform extensive experiments on benchmark datasets to demonstrate the superior performance of our method compared to state-of-the-art adaptation methods.
arXiv Detail & Related papers (2023-11-08T19:21:48Z)
Understanding Robust Overfitting from the Feature Generalization Perspective [61.770805867606796]
Adversarial training (AT) constructs robust neural networks by incorporating adversarial perturbations into natural data. It is plagued by the issue of robust overfitting (RO), which severely damages the model's robustness. In this paper, we investigate RO from a novel feature generalization perspective.
arXiv Detail & Related papers (2023-10-01T07:57:03Z)
Improved Factorized Neural Transducer Model For text-only Domain Adaptation [14.65352101664147]
Adapting End-to-End ASR models to out-of-domain datasets with text data is challenging. Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary. We present the improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information.
arXiv Detail & Related papers (2023-09-18T07:02:04Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z)
Can Transformers be Strong Treatment Effect Estimators? [86.32484218657166]
We develop a general framework based on the Transformer architecture to address a variety of treatment effect estimation problems. Our methods are applied to discrete, continuous, structured, or dosage-associated treatments. Our experiments with Transformers as Treatment Effect Estimators (TransTEE) demonstrate that these inductive biases are also effective on the sorts of estimation problems and datasets that arise in research aimed at estimating causal effects.
arXiv Detail & Related papers (2022-02-02T23:56:42Z)
Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking. They can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z)
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.