Neural-Based Test Oracle Generation: A Large-scale Evaluation and
Lessons Learned
- URL: http://arxiv.org/abs/2307.16023v2
- Date: Fri, 25 Aug 2023 22:26:59 GMT
- Title: Neural-Based Test Oracle Generation: A Large-scale Evaluation and
Lessons Learned
- Authors: Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Sebastian
Elbaum, Willem Visser
- Abstract summary: TOGA is a recently developed neural-based method for automatic test oracle generation.
It misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle.
These findings expose limitations of the state-of-the-art neural-based oracle generation technique.
- Score: 17.43060451305942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Defining test oracles is crucial and central to test development, but manual
construction of oracles is expensive. While recent neural-based automated test
oracle generation techniques have shown promise, their real-world effectiveness
remains a compelling question requiring further exploration and understanding.
This paper investigates the effectiveness of TOGA, a recently developed
neural-based method for automatic test oracle generation by Dinella et al. TOGA
utilizes EvoSuite-generated test inputs and generates both exception and
assertion oracles. In a Defects4j study, TOGA outperformed specification,
search, and neural-based techniques, detecting 57 bugs, including 30 unique
bugs not detected by other methods. To gain a deeper understanding of its
applicability in real-world settings, we conducted a series of external,
extended, and conceptual replication studies of TOGA.
In a large-scale study involving 25 real-world Java systems, 223.5K test
cases, and 51K injected faults, we evaluate TOGA's ability to improve
fault-detection effectiveness relative to the state-of-the-practice and the
state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24%
of the time and that when it classifies correctly around 62% of the time it is
not confident enough to generate any assertion oracle. When it does generate an
assertion oracle, more than 47% of them are false positives, and the true
positive assertions only increase fault detection by 0.3% relative to prior
work. These findings expose limitations of the state-of-the-art neural-based
oracle generation technique, provide valuable insights for improvement, and
offer lessons for evaluating future automated oracle generation methods.
Related papers
- Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z) - TOGLL: Correct and Strong Test Oracle Generation with LLMs [0.8057006406834466]
Test oracles play a crucial role in software testing, enabling effective bug detection.
Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives.
We present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles.
arXiv Detail & Related papers (2024-05-06T18:37:35Z) - GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education [0.0]
This study investigates the efficacy of six major Generative AI (GenAI) text detectors when confronted with machine-generated content that has been modified.
The results demonstrate that the detectors' already low accuracy rates (39.5%) show major reductions in accuracy (17.4%) when faced with manipulated content.
The accuracy limitations and the potential for false accusations demonstrate that these tools cannot currently be recommended for determining whether violations of academic integrity have occurred.
arXiv Detail & Related papers (2024-03-28T04:57:13Z) - Insight Into SEER [0.0]
The SEER tool was developed to predict test outcomes without needing assertion statements.
The tool has an overall accuracy of 93%, precision of 86%, recall of 94%, and an F1 score of 90%.
arXiv Detail & Related papers (2023-11-02T11:54:58Z) - Towards a Complete Metamorphic Testing Pipeline [56.75969180129005]
Metamorphic Testing (MT) addresses the test oracle problem by examining the relationships between input-output pairs in consecutive executions of the System Under Test (SUT)
These relations, known as Metamorphic Relations (MRs), specify the expected output changes resulting from specific input changes.
Our research aims to develop methods and tools that assist testers in generating MRs, defining constraints, and providing explainability for MR outcomes.
arXiv Detail & Related papers (2023-09-30T10:49:22Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z) - Towards More Realistic Evaluation for Neural Test Oracle Generation [11.005450298374285]
Unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain.
Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG)
These settings could mislead the understanding of existing NTOG approaches' performance.
arXiv Detail & Related papers (2023-05-26T15:56:57Z) - Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size.
We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks.
Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z) - Exploring linguistic feature and model combination for speech
recognition based automatic AD detection [61.91708957996086]
Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques.
Scarcity of specialist data leads to uncertainty in both model selection and feature learning when developing such systems.
This paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders.
arXiv Detail & Related papers (2022-06-28T05:09:01Z) - Anomaly Detection Based on Selection and Weighting in Latent Space [73.01328671569759]
We propose a novel selection-and-weighting-based anomaly detection framework called SWAD.
Experiments on both benchmark and real-world datasets have shown the effectiveness and superiority of SWAD.
arXiv Detail & Related papers (2021-03-08T10:56:38Z) - SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier
Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples.
We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.