Related papers: Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

URL: http://arxiv.org/abs/2504.16777v1
Date: Wed, 23 Apr 2025 14:51:23 GMT
Title: Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures
Authors: Owain Parry, Gregory Kapfhammer, Michael Hilton, Phil McMinn,
Abstract summary: Flaky tests produce inconsistent outcomes without code changes.<n>Developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250.<n>We show that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness.
Score: 6.824747267214373
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures. Systemic flakiness is significant and widespread. We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.

Related papers

Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis [4.554856650068748]
Flaky tests, which pass or fail inconsistently without code changes, are a major challenge in software engineering. We aim to create an automated framework to detect flaky tests in quantum software.
arXiv Detail & Related papers (2024-10-31T02:43:04Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z)
230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers [9.45325012281881]
Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes. How to quickly determine if a test failed due to flakiness, or if it detected a bug?
arXiv Detail & Related papers (2024-01-28T22:36:30Z)
Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures. We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Statistical and Computational Phase Transitions in Group Testing [73.55361918807883]
We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease. We consider two different simple random procedures for assigning individuals tests.
arXiv Detail & Related papers (2022-06-15T16:38:50Z)
What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness. We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.