When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications
- URL: http://arxiv.org/abs/2504.11826v1
- Date: Wed, 16 Apr 2025 07:22:44 GMT
- Title: When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications
- Authors: Sören Henning, Adriano Vogel, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser,
- Abstract summary: This paper empirically quantify the impact of cloud performance variability on benchmarking results.<n>With approximately 591 hours of experiments, deploying 789 clusters on AWS and executing 2366 benchmarks, this is likely the largest study of its kind.
- Score: 1.3398445165628463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance benchmarking is a common practice in software engineering, particularly when building large-scale, distributed, and data-intensive systems. While cloud environments offer several advantages for running benchmarks, it is often reported that benchmark results can vary significantly between repetitions -- making it difficult to draw reliable conclusions about real-world performance. In this paper, we empirically quantify the impact of cloud performance variability on benchmarking results, focusing on stream processing applications as a representative type of data-intensive, performance-critical system. In a longitudinal study spanning more than three months, we repeatedly executed an application benchmark used in research and development at Dynatrace. This allows us to assess various aspects of performance variability, particularly concerning temporal effects. With approximately 591 hours of experiments, deploying 789 Kubernetes clusters on AWS and executing 2366 benchmarks, this is likely the largest study of its kind and the only one addressing performance from an end-to-end, i.e., application benchmark perspective. Our study confirms that performance variability exists, but it is less pronounced than often assumed (coefficient of variation of < 3.7%). Unlike related studies, we find that performance does exhibit a daily and weekly pattern, although with only small variability (<= 2.5%). Re-using benchmarking infrastructure across multiple repetitions introduces only a slight reduction in result accuracy (<= 2.5 percentage points). These key observations hold consistently across different cloud regions and machine types with different processor architectures. We conclude that for engineers and researchers focused on detecting substantial performance differences (e.g., > 5%) in...
Related papers
- Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope? [39.78423267310698]
The application of reinforcement learning to dynamic resource allocation in optical networks has been the focus of intense research activity in recent years.<n>We present a review of progress in the field, and identify significant gaps in benchmarking practices and solutions.
arXiv Detail & Related papers (2025-02-18T12:09:42Z) - SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.<n>SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.<n>We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - Rethinking Few-Shot Object Detection on a Multi-Domain Benchmark [28.818423712485504]
Multi-dOmain Few-Shot Object Detection (MoFSOD) benchmark consists of 10 datasets from a wide range of domains.
We analyze the impacts of freezing layers, different architectures, and different pre-training datasets on FSOD performance.
arXiv Detail & Related papers (2022-07-22T16:13:22Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z) - Analyzing the Impact of Undersampling on the Benchmarking and
Configuration of Evolutionary Algorithms [3.967483941966979]
We show that care should be taken when making decisions based on limited data.
We show examples of performance losses of more than 20%, even when using statistical races to dynamically adjust the number of runs.
arXiv Detail & Related papers (2022-04-20T09:53:59Z) - ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through
Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer.
This paper proposes a well-designed model named ERNIE-Sparse.
It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z) - Multi-Domain Joint Training for Person Re-Identification [51.73921349603597]
Deep learning-based person Re-IDentification (ReID) often requires a large amount of training data to achieve good performance.
It appears that collecting more training data from diverse environments tends to improve the ReID performance.
We propose an approach called Domain-Camera-Sample Dynamic network (DCSD) whose parameters can be adaptive to various factors.
arXiv Detail & Related papers (2022-01-06T09:20:59Z) - DAPPER: Label-Free Performance Estimation after Personalization for
Heterogeneous Mobile Sensing [95.18236298557721]
We present DAPPER (Domain AdaPtation Performance EstimatoR) that estimates the adaptation performance in a target domain with unlabeled target data.
Our evaluation with four real-world sensing datasets compared against six baselines shows that DAPPER outperforms the state-of-the-art baseline by 39.8% in estimation accuracy.
arXiv Detail & Related papers (2021-11-22T08:49:33Z) - Benchmarking and Performance Modelling of MapReduce Communication
Pattern [0.0]
Models can be used to infer the performance of unseen applications and approximate their performance when an arbitrary dataset is used as input.
Our approach is validated by running empirical experiments in two setups.
arXiv Detail & Related papers (2020-05-23T21:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.