Related papers: Just-in-Time Flaky Test Detection via Abstracted Failure Symptom Matching

Just-in-Time Flaky Test Detection via Abstracted Failure Symptom Matching

URL: http://arxiv.org/abs/2310.06298v2
Date: Sat, 4 Nov 2023 08:51:44 GMT
Title: Just-in-Time Flaky Test Detection via Abstracted Failure Symptom Matching
Authors: Gabin An, Juyeon Yoon, Thomas Bach, Jingun Hong, Shin Yoo
Abstract summary: We use failure symptoms to identify flaky test failures in a Continuous Integration pipeline for a large industrial software system, SAP. Our method shows the potential of using failure symptoms to identify recurring flaky failures, achieving a precision of at least 96%.
Score: 11.677067576981075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We report our experience of using failure symptoms, such as error messages or stack traces, to identify flaky test failures in a Continuous Integration (CI) pipeline for a large industrial software system, SAP HANA. Although failure symptoms are commonly used to identify similar failures, they have not previously been employed to detect flaky test failures. Our hypothesis is that flaky failures will exhibit symptoms distinct from those of non-flaky failures. Consequently, we can identify recurring flaky failures, without rerunning the tests, by matching the failure symptoms to those of historical flaky runs. This can significantly reduce the need for test reruns, ultimately resulting in faster delivery of test results to developers. To facilitate the process of matching flaky failures across different execution instances, we abstract newer test failure symptoms before matching them to the known patterns of flaky failures, inspired by previous research in the fields of failure deduplication and log analysis. We evaluate our symptom-based flakiness detection method using actual failure symptoms gathered from CI data of SAP HANA during a six-month period. Our method shows the potential of using failure symptoms to identify recurring flaky failures, achieving a precision of at least 96%, while saving approximately 58% of the machine time compared to the traditional rerun strategy. Analysis of the false positives and the feedback from developers underscore the importance of having descriptive and informative failure symptoms for both the effective deployment of this symptom-based approach and the debugging of flaky tests.

Related papers

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures [6.824747267214373]
Flaky tests produce inconsistent outcomes without code changes. Developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We show that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness.
arXiv Detail & Related papers (2025-04-23T14:51:23Z)
Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies [19.27526590452503]
FAIL-Detect is a two-stage approach for failure detection in imitation learning-based robotic manipulation. We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture uncertainty. Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator.
arXiv Detail & Related papers (2025-03-11T15:47:12Z)
Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [55.17761802332469]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample. Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications. We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z)
230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers [9.45325012281881]
Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes. How to quickly determine if a test failed due to flakiness, or if it detected a bug?
arXiv Detail & Related papers (2024-01-28T22:36:30Z)
Semi-supervised learning via DQN for log anomaly detection [1.5339370927841764]
Current methods in log anomaly detection face challenges such as underutilization of unlabeled data, imbalance between normal and anomaly class data, and high rates of false positives and false negatives. We propose a semi-supervised log anomaly detection method named DQNLog, which integrates deep reinforcement learning to enhance anomaly detection performance. We evaluate DQNLog on three widely used datasets, demonstrating its ability to effectively utilize large-scale unlabeled data.
arXiv Detail & Related papers (2024-01-06T08:04:13Z)
Test Generation Strategies for Building Failure Models and Explaining Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z)
PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z)
Are we certain it's anomalous? [57.729669157989235]
Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD) HypAD learns self-supervisedly to reconstruct the input signal.
arXiv Detail & Related papers (2022-11-16T21:31:39Z)
Explainable Deep Few-shot Anomaly Detection with Deviation Networks [123.46611927225963]
We introduce a novel weakly-supervised anomaly detection framework to train detection models. The proposed approach learns discriminative normality by leveraging the labeled anomalies and a prior probability. Our model is substantially more sample-efficient and robust, and performs significantly better than state-of-the-art competing methods in both closed-set and open-set settings.
arXiv Detail & Related papers (2021-08-01T14:33:17Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
FaultFace: Deep Convolutional Generative Adversarial Network (DCGAN) based Ball-Bearing Failure Detection Method [4.543665832042712]
This paper proposes a methodology called FaultFace for failure detection on Ball-Bearing joints for rotational shafts. Deep Convolutional Generative Adversarial Network is employed to produce new faceportraits of the nominal and failure behaviors to get a balanced dataset.
arXiv Detail & Related papers (2020-07-30T06:37:53Z)
Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm. Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.