Flaky Tests in a Large Industrial Database Management System: An Empirical Study of Fixed Issue Reports for SAP HANA
- URL: http://arxiv.org/abs/2602.03556v1
- Date: Tue, 03 Feb 2026 14:03:59 GMT
- Title: Flaky Tests in a Large Industrial Database Management System: An Empirical Study of Fixed Issue Reports for SAP HANA
- Authors: Alexander Berndt, Thomas Bach, Sebastian Baltes,
- Abstract summary: Flaky tests yield different results when executed multiple times for the same version of the source code.<n>A variety of factors can cause test flakiness.<n> approaches to fix flaky tests are typically tailored to address specific causes.
- Score: 45.467566253448666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Flaky tests yield different results when executed multiple times for the same version of the source code. Thus, they provide an ambiguous signal about the quality of the code and interfere with the automated assessment of code changes. While a variety of factors can cause test flakiness, approaches to fix flaky tests are typically tailored to address specific causes. However, the prevalent root causes of flaky tests can vary depending on the programming language, application domain, or size of the software project. Since manually labeling flaky tests is time-consuming and tedious, this work proposes an LLMs-as-annotators approach that leverages intra- and inter-model consistency to label issue reports related to fixed flakiness issues with the relevant root cause category. This allows us to gain an overview of prevalent flakiness categories in the issue reports. We evaluated our labeling approach in the context of SAP HANA, a large industrial database management system. Our results suggest that SAP HANA's tests most commonly suffer from issues related to concurrency (23%, 130 of 559 analyzed issue reports). Moreover, our results suggest that different test types face different flakiness challenges. Therefore, we encourage future research on flakiness mitigation to consider evaluating the generalizability of proposed approaches across different test types.
Related papers
- The Vocabulary of Flaky Tests in the Context of SAP HANA [43.04215607079248]
flaky tests fail seemingly at random without changes to the code.<n>Previous work proposed to identify flaky tests based on source code identifiers in the test code.<n>We evaluate approaches to identify flaky tests and their root causes based on source code identifiers in the test code in a large-scale industrial project.
arXiv Detail & Related papers (2026-02-27T11:59:23Z) - Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study [40.93176986225226]
Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision.<n>Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code.
arXiv Detail & Related papers (2026-02-05T09:15:09Z) - On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems [42.98432295929164]
Flaky tests are inconsistent results when executed multiple times on the same code.<n>Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests.<n>Our study informs developers on what types of flakiness to expect from LLM-generated tests.
arXiv Detail & Related papers (2026-01-13T21:48:28Z) - Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures [6.824747267214373]
Flaky tests produce inconsistent outcomes without code changes.<n>Developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250.<n>We show that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness.
arXiv Detail & Related papers (2025-04-23T14:51:23Z) - Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
API providers may quantize, watermark, or finetune the underlying model, changing the output distribution.<n>We formalize detecting such distortions by Model Equality Testing, a two-sample testing problem.<n>A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - FlaKat: A Machine Learning-Based Categorization Framework for Flaky
Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system.
State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - Using Metamorphic Relations to Verify and Enhance Artcode Classification [39.36253474867746]
An example of an area facing the oracle problem is automatic image classification, using machine learning to classify an input image as one of a set of predefined classes.
An approach to software testing that alleviates the oracle problem is metamorphic testing (MT)
This paper examines the problem of classifying images containing visually hidden markers called Artcodes, and applies MT to verify and enhance the trained classifiers.
arXiv Detail & Related papers (2021-08-05T15:54:56Z) - What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness.
We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.