FlaPy: Mining Flaky Python Tests at Scale
- URL: http://arxiv.org/abs/2305.04793v1
- Date: Mon, 8 May 2023 15:48:57 GMT
- Title: FlaPy: Mining Flaky Python Tests at Scale
- Authors: Martin Gruber, Gordon Fraser
- Abstract summary: FlaPy is a framework for researchers to mine flaky tests in a given or automatically sampled set of Python projects by rerunning their test suites.
FlaPy isolates the test executions using containerization and fresh execution environments to simulate real-world CI conditions.
FlaPy supports parallelizing the test executions using SLURM, making it feasible to scan thousands of projects for test flakiness.
- Score: 14.609208863749831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flaky tests obstruct software development, and studying and proposing
mitigations against them has therefore become an important focus of software
engineering research. To conduct sound investigations on test flakiness, it is
crucial to have large, diverse, and unbiased datasets of flaky tests. A common
method to build such datasets is by rerunning the test suites of selected
projects multiple times and checking for tests that produce different outcomes.
While using this technique on a single project is mostly straightforward,
applying it to a large and diverse set of projects raises several
implementation challenges such as (1) isolating the test executions, (2)
supporting multiple build mechanisms, (3) achieving feasible run times on large
datasets, and (4) analyzing and presenting the test outcomes. To address these
challenges we introduce FlaPy, a framework for researchers to mine flaky tests
in a given or automatically sampled set of Python projects by rerunning their
test suites. FlaPy isolates the test executions using containerization and
fresh execution environments to simulate real-world CI conditions and to
achieve accurate results. By supporting multiple dependency installation
strategies, it promotes diversity among the studied projects. FlaPy supports
parallelizing the test executions using SLURM, making it feasible to scan
thousands of projects for test flakiness. Finally, FlaPy analyzes the test
outcomes to determine which tests are flaky and depicts the results in a
concise table. A demo video of FlaPy is available at
https://youtu.be/ejy-be-FvDY
Related papers
- Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - The Effects of Computational Resources on Flaky Tests [9.694460778355925]
Flaky tests are tests that nondeterministically pass and fail in unchanged code.
Resource-Affected Flaky Tests indicate that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests.
arXiv Detail & Related papers (2023-10-18T17:42:58Z) - Do Automatic Test Generation Tools Generate Flaky Tests? [12.813573907094074]
The prevalence and nature of flaky tests produced by test generation tools remain largely unknown.
We generate tests using EvoSuite (Java) and Pynguin (Python) and execute each test 200 times.
Our results show that flakiness is at least as common in generated tests as in developer-written tests.
arXiv Detail & Related papers (2023-10-08T16:44:27Z) - Exploring Demonstration Ensembling for In-context Learning [75.35436025709049]
In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task.
The standard approach for ICL is to prompt the LMd demonstrations followed by the test input.
In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation.
arXiv Detail & Related papers (2023-08-17T04:45:19Z) - Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures.
We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness.
We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.