FlaPy: Mining Flaky Python Tests at Scale
- URL: http://arxiv.org/abs/2305.04793v1
- Date: Mon, 8 May 2023 15:48:57 GMT
- Title: FlaPy: Mining Flaky Python Tests at Scale
- Authors: Martin Gruber, Gordon Fraser
- Abstract summary: FlaPy is a framework for researchers to mine flaky tests in a given or automatically sampled set of Python projects by rerunning their test suites.
FlaPy isolates the test executions using containerization and fresh execution environments to simulate real-world CI conditions.
FlaPy supports parallelizing the test executions using SLURM, making it feasible to scan thousands of projects for test flakiness.
- Score: 14.609208863749831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flaky tests obstruct software development, and studying and proposing
mitigations against them has therefore become an important focus of software
engineering research. To conduct sound investigations on test flakiness, it is
crucial to have large, diverse, and unbiased datasets of flaky tests. A common
method to build such datasets is by rerunning the test suites of selected
projects multiple times and checking for tests that produce different outcomes.
While using this technique on a single project is mostly straightforward,
applying it to a large and diverse set of projects raises several
implementation challenges such as (1) isolating the test executions, (2)
supporting multiple build mechanisms, (3) achieving feasible run times on large
datasets, and (4) analyzing and presenting the test outcomes. To address these
challenges we introduce FlaPy, a framework for researchers to mine flaky tests
in a given or automatically sampled set of Python projects by rerunning their
test suites. FlaPy isolates the test executions using containerization and
fresh execution environments to simulate real-world CI conditions and to
achieve accurate results. By supporting multiple dependency installation
strategies, it promotes diversity among the studied projects. FlaPy supports
parallelizing the test executions using SLURM, making it feasible to scan
thousands of projects for test flakiness. Finally, FlaPy analyzes the test
outcomes to determine which tests are flaky and depicts the results in a
concise table. A demo video of FlaPy is available at
https://youtu.be/ejy-be-FvDY
Related papers
- TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - Collaborative non-parametric two-sample testing [55.98760097296213]
The goal is to identify nodes where the null hypothesis $p_v = q_v$ should be rejected.
We propose the non-parametric collaborative two-sample testing (CTST) framework that efficiently leverages the graph structure.
Our methodology integrates elements from f-divergence estimation, Kernel Methods, and Multitask Learning.
arXiv Detail & Related papers (2024-02-08T14:43:56Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [51.66447662096959]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - The Effects of Computational Resources on Flaky Tests [9.694460778355925]
Flaky tests are tests that nondeterministically pass and fail in unchanged code.
Resource-Affected Flaky Tests indicate that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests.
arXiv Detail & Related papers (2023-10-18T17:42:58Z) - Do Automatic Test Generation Tools Generate Flaky Tests? [12.813573907094074]
The prevalence and nature of flaky tests produced by test generation tools remain largely unknown.
We generate tests using EvoSuite (Java) and Pynguin (Python) and execute each test 200 times.
Our results show that flakiness is at least as common in generated tests as in developer-written tests.
arXiv Detail & Related papers (2023-10-08T16:44:27Z) - Exploring Demonstration Ensembling for In-context Learning [75.35436025709049]
In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task.
The standard approach for ICL is to prompt the LMd demonstrations followed by the test input.
In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation.
arXiv Detail & Related papers (2023-08-17T04:45:19Z) - Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures.
We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z) - On the use of test smells for prediction of flaky tests [0.0]
flaky tests hamper the evaluation of test results and can increase costs.
Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting.
We investigate the use of test smells as predictors of flaky tests.
arXiv Detail & Related papers (2021-08-26T13:21:55Z) - What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness.
We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.