A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
- URL: http://arxiv.org/abs/2404.06203v3
- Date: Wed, 29 May 2024 06:22:16 GMT
- Title: A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
- Authors: Adriano Vogel, Sören Henning, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser,
- Abstract summary: This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment.
Our results indicate that Flink is the most stable and has one of the best fault recovery.
K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
- Score: 1.3398445165628463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, several software systems rely on stream processing architectures to deliver scalable performance and handle large volumes of data in near real-time. Stream processing frameworks facilitate scalable computing by distributing the application's execution across multiple machines. Despite performance being extensively studied, the measurement of fault tolerance-a key feature offered by stream processing frameworks-has still not been measured properly with updated and comprehensive testbeds. Moreover, the impact that fault recovery can have on performance is mostly ignored. This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment with modern open-source frameworks, namely Flink, Kafka Streams, and Spark Structured Streaming. Our benchmarking analysis is inspired by chaos engineering to inject failures. Generally, our results indicate that much has changed compared to previous studies on fault recovery in distributed stream processing. In particular, the results indicate that Flink is the most stable and has one of the best fault recovery. Moreover, Kafka Streams shows performance instabilities after failures, which is due to its current rebalancing strategy that can be suboptimal in terms of load balancing. Spark Structured Streaming shows suitable fault recovery performance and stability, but with higher event latency. Our study intends to (i) help industry practitioners in choosing the most suitable stream processing framework for efficient and reliable executions of data-intensive applications; (ii) support researchers in applying and extending our research method as well as our benchmark; (iii) identify, prevent, and assist in solving potential issues in production deployments.
Related papers
- SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.
SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.
We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models [12.581051275141537]
During software development, developers often make numerous modifications to the software to address existing issues or implement new features.
To ensure that the performance of new software releases does not degrade, existing practices rely on system-level performance testing.
We propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models.
arXiv Detail & Related papers (2024-08-15T13:33:20Z) - High-level Stream Processing: A Complementary Analysis of Fault Recovery [1.3398445165628463]
We focus on robust deployment setups inspired by requirements for near real-time analytics of a large cloud observability platform.
The results indicate significant potential for improving fault recovery and performance.
New abstractions for transparent configuration tuning are also needed for large-scale industry setups.
arXiv Detail & Related papers (2024-05-13T16:48:57Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with
Distributed Stream Processing Frameworks [1.4374467687356276]
This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks.
ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform.
Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
arXiv Detail & Related papers (2024-03-07T15:06:24Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Pathway: a fast and flexible unified stream data processing framework
for analytical and Machine Learning applications [7.850979932441607]
Pathway is a new unified data processing framework that can run workloads on both bounded and unbounded data streams.
We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts.
arXiv Detail & Related papers (2023-07-12T08:27:37Z) - FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization
Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations.
We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations.
To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z) - Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability.
To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance.
To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.