230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure
Classifiers
- URL: http://arxiv.org/abs/2401.15788v1
- Date: Sun, 28 Jan 2024 22:36:30 GMT
- Title: 230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure
Classifiers
- Authors: Abdulrahman Alshammari, Paul Ammann, Michael Hilton, Jonathan Bell
- Abstract summary: Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes.
How to quickly determine if a test failed due to flakiness, or if it detected a bug?
- Score: 9.45325012281881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flaky tests are tests that can non-deterministically pass or fail, even in
the absence of code changes.Despite being a source of false alarms, flaky tests
often remain in test suites once they are detected, as they also may be relied
upon to detect true failures. Hence, a key open problem in flaky test research
is: How to quickly determine if a test failed due to flakiness, or if it
detected a bug? The state-of-the-practice is for developers to re-run failing
tests: if a test fails and then passes, it is flaky by definition; if the test
persistently fails, it is likely a true failure. However, this approach can be
both ineffective and inefficient. An alternate approach that developers may
already use for triaging test failures is failure de-duplication, which matches
newly discovered test failures to previously witnessed flaky and true failures.
However, because flaky test failure symptoms might resemble those of true
failures, there is a risk of missclassifying a true test failure as a flaky
failure to be ignored. Using a dataset of 498 flaky tests from 22 open-source
Java projects, we collect a large dataset of 230,439 failure messages (both
flaky and not), allowing us to empirically investigate the efficacy of failure
de-duplication. We find that for some projects, this approach is extremely
effective (with 100\% specificity), while for other projects, the approach is
entirely ineffective. By analyzing the characteristics of these flaky and
non-flaky failures, we provide useful guidance on how developers should rely on
this approach.
Related papers
- Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - The Effects of Computational Resources on Flaky Tests [9.694460778355925]
Flaky tests are tests that nondeterministically pass and fail in unchanged code.
Resource-Affected Flaky Tests indicate that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests.
arXiv Detail & Related papers (2023-10-18T17:42:58Z) - Just-in-Time Flaky Test Detection via Abstracted Failure Symptom
Matching [11.677067576981075]
We use failure symptoms to identify flaky test failures in a Continuous Integration pipeline for a large industrial software system, SAP.
Our method shows the potential of using failure symptoms to identify recurring flaky failures, achieving a precision of at least 96%.
arXiv Detail & Related papers (2023-10-10T04:15:45Z) - Do Automatic Test Generation Tools Generate Flaky Tests? [12.813573907094074]
The prevalence and nature of flaky tests produced by test generation tools remain largely unknown.
We generate tests using EvoSuite (Java) and Pynguin (Python) and execute each test 200 times.
Our results show that flakiness is at least as common in generated tests as in developer-written tests.
arXiv Detail & Related papers (2023-10-08T16:44:27Z) - Perfect is the enemy of test oracle [1.457696018869121]
Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes.
This paper presents SEER, a learning-based approach that in the absence of test assertions can determine whether a unit test passes or fails on a given method under test (MUT)
Our experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is effective in predicting the fail or pass labels.
arXiv Detail & Related papers (2023-02-03T01:49:33Z) - On the use of test smells for prediction of flaky tests [0.0]
flaky tests hamper the evaluation of test results and can increase costs.
Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting.
We investigate the use of test smells as predictors of flaky tests.
arXiv Detail & Related papers (2021-08-26T13:21:55Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm.
Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.