Related papers: A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks

A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks

URL: http://arxiv.org/abs/2404.06203v3
Date: Wed, 29 May 2024 06:22:16 GMT
Title: A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
Authors: Adriano Vogel, Sören Henning, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser,
Abstract summary: This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment. Our results indicate that Flink is the most stable and has one of the best fault recovery. K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
Score: 1.3398445165628463
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nowadays, several software systems rely on stream processing architectures to deliver scalable performance and handle large volumes of data in near real-time. Stream processing frameworks facilitate scalable computing by distributing the application's execution across multiple machines. Despite performance being extensively studied, the measurement of fault tolerance-a key feature offered by stream processing frameworks-has still not been measured properly with updated and comprehensive testbeds. Moreover, the impact that fault recovery can have on performance is mostly ignored. This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment with modern open-source frameworks, namely Flink, Kafka Streams, and Spark Structured Streaming. Our benchmarking analysis is inspired by chaos engineering to inject failures. Generally, our results indicate that much has changed compared to previous studies on fault recovery in distributed stream processing. In particular, the results indicate that Flink is the most stable and has one of the best fault recovery. Moreover, Kafka Streams shows performance instabilities after failures, which is due to its current rebalancing strategy that can be suboptimal in terms of load balancing. Spark Structured Streaming shows suitable fault recovery performance and stability, but with higher event latency. Our study intends to (i) help industry practitioners in choosing the most suitable stream processing framework for efficient and reliable executions of data-intensive applications; (ii) support researchers in applying and extending our research method as well as our benchmark; (iii) identify, prevent, and assist in solving potential issues in production deployments.

Related papers

InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems [76.39776789410088]
This work introduces a framework that combines the strong performance of supervised approaches and the flexibility of zero-shot methods. A novel architectural design seamlessly integrates the degradation operator directly into the denoiser. Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance.
arXiv Detail & Related papers (2025-04-02T12:40:57Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [61.492590008258986]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. We propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains.
arXiv Detail & Related papers (2024-11-21T12:02:39Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models [12.581051275141537]
During software development, developers often make numerous modifications to the software to address existing issues or implement new features. To ensure that the performance of new software releases does not degrade, existing practices rely on system-level performance testing. We propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models.
arXiv Detail & Related papers (2024-08-15T13:33:20Z)
High-level Stream Processing: A Complementary Analysis of Fault Recovery [1.3398445165628463]
We focus on robust deployment setups inspired by requirements for near real-time analytics of a large cloud observability platform. The results indicate significant potential for improving fault recovery and performance. New abstractions for transparent configuration tuning are also needed for large-scale industry setups.
arXiv Detail & Related papers (2024-05-13T16:48:57Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks [1.4374467687356276]
This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
arXiv Detail & Related papers (2024-03-07T15:06:24Z)
Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications [7.850979932441607]
Pathway is a new unified data processing framework that can run workloads on both bounded and unbounded data streams. We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts.
arXiv Detail & Related papers (2023-07-12T08:27:37Z)
FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z)
Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability. To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)
Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception. The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.