Related papers: Towards More Realistic Evaluation for Neural Test Oracle Generation

Towards More Realistic Evaluation for Neural Test Oracle Generation

URL: http://arxiv.org/abs/2305.17047v1
Date: Fri, 26 May 2023 15:56:57 GMT
Title: Towards More Realistic Evaluation for Neural Test Oracle Generation
Authors: Zhongxin Liu, Kui Liu, Xin Xia, Xiaohu Yang
Abstract summary: Unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG) These settings could mislead the understanding of existing NTOG approaches' performance.
Score: 11.005450298374285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches' performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

Related papers

Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection [1.4530711901349282]
Test-Time Adaptation (TTA) has emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts. We evaluate existing TTA methods using surrogate-based hp-selection strategies to obtain a more realistic evaluation of their performance.
arXiv Detail & Related papers (2024-07-19T11:58:30Z)
Test-Time Personalization with Meta Prompt for Gaze Estimation [23.01057994927244]
We take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. We propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss.
arXiv Detail & Related papers (2024-01-03T07:02:35Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation [15.728090002818963]
Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels.
arXiv Detail & Related papers (2023-08-01T05:01:05Z)
Neural-Based Test Oracle Generation: A Large-scale Evaluation and Lessons Learned [17.43060451305942]
TOGA is a recently developed neural-based method for automatic test oracle generation. It misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. These findings expose limitations of the state-of-the-art neural-based oracle generation technique.
arXiv Detail & Related papers (2023-07-29T16:34:56Z)
Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph. A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task. New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z)
On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts. We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z)
Artificial Text Detection via Examining the Topology of Attention Maps [58.46367297712477]
We propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10% on three common datasets. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties.
arXiv Detail & Related papers (2021-09-10T12:13:45Z)
Robustness Gym: Unifying the NLP Evaluation Landscape [91.80175115162218]
Deep neural networks are often brittle when deployed in real-world systems. Recent research has focused on testing the robustness of such models. We propose a solution in the form of Robustness Gym, a simple and evaluation toolkit.
arXiv Detail & Related papers (2021-01-13T02:37:54Z)
ScoreGAN: A Fraud Review Detector based on Multi Task Learning of Regulated GAN with Data Augmentation [50.779498955162644]
We propose ScoreGAN for fraud review detection that makes use of both review text and review rating scores in the generation and detection process. Results show that the proposed framework outperformed the existing state-of-the-art framework, namely FakeGAN, in terms of AP by 7%, and 5% on the Yelp and TripAdvisor datasets.
arXiv Detail & Related papers (2020-06-11T16:15:06Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.