Do Automatic Test Generation Tools Generate Flaky Tests?
- URL: http://arxiv.org/abs/2310.05223v1
- Date: Sun, 8 Oct 2023 16:44:27 GMT
- Title: Do Automatic Test Generation Tools Generate Flaky Tests?
- Authors: Martin Gruber, Muhammad Firhard Roslan, Owain Parry, Fabian
Scharnb\"ock, Phil McMinn, Gordon Fraser
- Abstract summary: The prevalence and nature of flaky tests produced by test generation tools remain largely unknown.
We generate tests using EvoSuite (Java) and Pynguin (Python) and execute each test 200 times.
Our results show that flakiness is at least as common in generated tests as in developer-written tests.
- Score: 12.813573907094074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-deterministic test behavior, or flakiness, is common and dreaded among
developers. Researchers have studied the issue and proposed approaches to
mitigate it. However, the vast majority of previous work has only considered
developer-written tests. The prevalence and nature of flaky tests produced by
test generation tools remain largely unknown. We ask whether such tools also
produce flaky tests and how these differ from developer-written ones.
Furthermore, we evaluate mechanisms that suppress flaky test generation. We
sample 6 356 projects written in Java or Python. For each project, we generate
tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200
times, looking for inconsistent outcomes. Our results show that flakiness is at
least as common in generated tests as in developer-written tests. Nevertheless,
existing flakiness suppression mechanisms implemented in EvoSuite are effective
in alleviating this issue (71.7 % fewer flaky tests). Compared to
developer-written flaky tests, the causes of generated flaky tests are
distributed differently. Their non-deterministic behavior is more frequently
caused by randomness, rather than by networking and concurrency. Using
flakiness suppression, the remaining flaky tests differ significantly from any
flakiness previously reported, where most are attributable to runtime
optimizations and EvoSuite-internal resource thresholds. These insights, with
the accompanying dataset, can help maintainers to improve test generation
tools, give recommendations for developers using these tools, and serve as a
foundation for future research in test flakiness or test generation.
Related papers
- TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - 230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure
Classifiers [9.45325012281881]
Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes.
How to quickly determine if a test failed due to flakiness, or if it detected a bug?
arXiv Detail & Related papers (2024-01-28T22:36:30Z) - TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion [15.13443954421825]
This paper introduces TestSpark, a plugin for IntelliJ IDEA that enables users to generate unit tests with only a few clicks.
TestSpark also allows users to easily modify and run each generated test and integrate them into the project workflow.
arXiv Detail & Related papers (2024-01-12T13:53:57Z) - The Effects of Computational Resources on Flaky Tests [9.694460778355925]
Flaky tests are tests that nondeterministically pass and fail in unchanged code.
Resource-Affected Flaky Tests indicate that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests.
arXiv Detail & Related papers (2023-10-18T17:42:58Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - FlaPy: Mining Flaky Python Tests at Scale [14.609208863749831]
FlaPy is a framework for researchers to mine flaky tests in a given or automatically sampled set of Python projects by rerunning their test suites.
FlaPy isolates the test executions using containerization and fresh execution environments to simulate real-world CI conditions.
FlaPy supports parallelizing the test executions using SLURM, making it feasible to scan thousands of projects for test flakiness.
arXiv Detail & Related papers (2023-05-08T15:48:57Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.