The Vocabulary of Flaky Tests in the Context of SAP HANA
- URL: http://arxiv.org/abs/2602.23957v1
- Date: Fri, 27 Feb 2026 11:59:23 GMT
- Title: The Vocabulary of Flaky Tests in the Context of SAP HANA
- Authors: Alexander Berndt, Zoltán Nochta, Thomas Bach,
- Abstract summary: flaky tests fail seemingly at random without changes to the code.<n>Previous work proposed to identify flaky tests based on source code identifiers in the test code.<n>We evaluate approaches to identify flaky tests and their root causes based on source code identifiers in the test code in a large-scale industrial project.
- Score: 43.04215607079248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background. Automated test execution is an important activity to gather information about the quality of a software project. So-called flaky tests, however, negatively affect this process. Such tests fail seemingly at random without changes to the code and thus do not provide a clear signal. Previous work proposed to identify flaky tests based on the source code identifiers in the test code. So far, these approaches have not been evaluated in a large-scale industrial setting. Aims. We evaluate approaches to identify flaky tests and their root causes based on source code identifiers in the test code in a large-scale industrial project. Method. First, we replicate previous work by Pinto et al. in the context of SAP HANA. Second, we assess different feature extraction techniques, namely TF-IDF and TF-IDFC-RF. Third, we evaluate CodeBERT and XGBoost as classification models. For a sound comparison, we utilize both the data set from previous work and two data sets from SAP HANA. Results. Our replication shows similar results on the original data set and on one of the SAP HANA data sets. While the original approach yielded an F1-Score of 0.94 on the original data set and 0.92 on the SAP HANA data set, our extensions achieve F1-Scores of 0.96 and 0.99, respectively. The reliance on external data sources is a common root cause for test flakiness in the context of SAP HANA. Conclusions. The vocabulary of a large industrial project seems to be slightly different with respect to the exact terms, but the categories for the terms, such as remote dependencies, are similar to previous empirical findings. However, even with rather large F1-Scores, both finding source code identifiers for flakiness and a black box prediction have limited use in practice as the results are not actionable for developers.
Related papers
- Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study [40.93176986225226]
Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision.<n>Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code.
arXiv Detail & Related papers (2026-02-05T09:15:09Z) - Flaky Tests in a Large Industrial Database Management System: An Empirical Study of Fixed Issue Reports for SAP HANA [45.467566253448666]
Flaky tests yield different results when executed multiple times for the same version of the source code.<n>A variety of factors can cause test flakiness.<n> approaches to fix flaky tests are typically tailored to address specific causes.
arXiv Detail & Related papers (2026-02-03T14:03:59Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
API providers may quantize, watermark, or finetune the underlying model, changing the output distribution.<n>We formalize detecting such distortions by Model Equality Testing, a two-sample testing problem.<n>A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair [0.5749787074942512]
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test.
In this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis.
One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category.
arXiv Detail & Related papers (2023-06-21T19:34:16Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - Labeling-Free Comparison Testing of Deep Learning Models [28.47632100019289]
We propose a labeling-free comparison testing approach to overcome the limitations of labeling effort and sampling randomness.
Our approach outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $tau$, regardless of the dataset and distribution shift.
arXiv Detail & Related papers (2022-04-08T10:55:45Z) - What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness.
We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.