Perfect is the enemy of test oracle
- URL: http://arxiv.org/abs/2302.01488v2
- Date: Wed, 5 Apr 2023 23:32:13 GMT
- Title: Perfect is the enemy of test oracle
- Authors: Ali Reza Ibrahimzada, Yigit Varli, Dilara Tekinoglu, Reyhaneh
Jabbarvand
- Abstract summary: Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes.
This paper presents SEER, a learning-based approach that in the absence of test assertions can determine whether a unit test passes or fails on a given method under test (MUT)
Our experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is effective in predicting the fail or pass labels.
- Score: 1.457696018869121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automation of test oracles is one of the most challenging facets of software
testing, but remains comparatively less addressed compared to automated test
input generation. Test oracles rely on a ground-truth that can distinguish
between the correct and buggy behavior to determine whether a test fails
(detects a bug) or passes. What makes the oracle problem challenging and
undecidable is the assumption that the ground-truth should know the exact
expected, correct, or buggy behavior. However, we argue that one can still
build an accurate oracle without knowing the exact correct or buggy behavior,
but how these two might differ. This paper presents SEER, a learning-based
approach that in the absence of test assertions or other types of oracle, can
determine whether a unit test passes or fails on a given method under test
(MUT). To build the ground-truth, SEER jointly embeds unit tests and the
implementation of MUTs into a unified vector space, in such a way that the
neural representation of tests are similar to that of MUTs they pass on them,
but dissimilar to MUTs they fail on them. The classifier built on top of this
vector representation serves as the oracle to generate "fail" labels, when test
inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive
experiments on applying SEER to more than 5K unit tests from a diverse set of
open-source Java projects show that the produced oracle is (1) effective in
predicting the fail or pass labels, achieving an overall accuracy, precision,
recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting
the labels for the unit test of projects that were not in training or
validation set with negligible performance drop, and (3) efficient, detecting
the existence of bugs in only 6.5 milliseconds on average.
Related papers
- Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs [21.06722050714324]
This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages.
We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
arXiv Detail & Related papers (2024-11-04T04:24:25Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - LLM-Powered Test Case Generation for Detecting Tricky Bugs [30.82169191775785]
AID generates test inputs and oracles targeting plausibly correct programs.
We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus.
The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.
arXiv Detail & Related papers (2024-04-16T06:20:06Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - Mind the Gap: The Difference Between Coverage and Mutation Score Can
Guide Testing Efforts [8.128730027609471]
An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation.
Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage.
We propose a new framework for reasoning about the extent, limits, and nature of a given testing effort based on an idea we call the oracle gap.
arXiv Detail & Related papers (2023-09-05T17:05:52Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z) - Auditing AI models for Verified Deployment under Semantic Specifications [65.12401653917838]
AuditAI bridges the gap between interpretable formal verification and scalability.
We show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations.
arXiv Detail & Related papers (2021-09-25T22:53:24Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.