Related papers: High-level Stream Processing: A Complementary Analysis of Fault Recovery

High-level Stream Processing: A Complementary Analysis of Fault Recovery

URL: http://arxiv.org/abs/2405.07917v1
Date: Mon, 13 May 2024 16:48:57 GMT
Title: High-level Stream Processing: A Complementary Analysis of Fault Recovery
Authors: Adriano Vogel, Sören Henning, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser,
Abstract summary: We focus on robust deployment setups inspired by requirements for near real-time analytics of a large cloud observability platform. The results indicate significant potential for improving fault recovery and performance. New abstractions for transparent configuration tuning are also needed for large-scale industry setups.
Score: 1.3398445165628463
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software architectural style. Several software systems rely on stream processing to deliver scalable performance, whereas open-source frameworks provide coding abstraction and high-level parallel computing. Although stream processing's performance is being extensively studied, the measurement of fault tolerance--a key abstraction offered by stream processing frameworks--has still not been adequately measured with comprehensive testbeds. In this work, we extend the previous fault recovery measurements with an exploratory analysis of the configuration space, additional experimental measurements, and analysis of improvement opportunities. We focus on robust deployment setups inspired by requirements for near real-time analytics of a large cloud observability platform. The results indicate significant potential for improving fault recovery and performance. However, these improvements entail grappling with configuration complexities, particularly in identifying and selecting the configurations to be fine-tuned and determining the appropriate values for them. Therefore, new abstractions for transparent configuration tuning are also needed for large-scale industry setups. We believe that more software engineering efforts are needed to provide insights into potential abstractions and how to achieve them. The stream processing community and industry practitioners could also benefit from more interactions with the high-level parallel programming community, whose expertise and insights on making parallel programming more productive and efficient could be extended.

Related papers

Training Report of TeleChat3-MoE [77.94641922160359]
This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes.<n>We detail systematic methodologies for operator-level and end-to-end numerical verification accuracy, ensuring consistency across hardware platforms.<n>A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations.
arXiv Detail & Related papers (2025-12-30T11:42:14Z)
Optimizing video analytics inference pipelines: a case study [3.4152678224558333]
This paper presents a comprehensive case study on optimizing a poultry welfare monitoring system.<n>We introduce a set of optimizations, including multi-level parallelization, optimizing code with substituting CPU code with GPU-accelerated code, vectorized clustering, and memory-efficient post-processing.
arXiv Detail & Related papers (2025-12-07T21:17:53Z)
FlashResearch: Real-time Agent Orchestration for Efficient Deep Research [62.03819662340356]
FlashResearch is a novel framework for efficient deep research.<n>It transforms sequential processing into parallel, runtime orchestration.<n>It can deliver up to a 5x speedup while maintaining comparable quality.
arXiv Detail & Related papers (2025-10-02T00:15:39Z)
Leveraging Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems [5.241450170761232]
This work presents a comprehensive evaluation of neural network graph compilers across heterogeneous hardware platforms. Our systematic analysis reveals that graph compilers exhibit performance patterns highly dependent on both neural architecture and batch sizes. We introduce novel metrics to quantify a compiler's ability to mitigate performance friction as batch size increases.
arXiv Detail & Related papers (2025-04-28T19:02:16Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Fast and Efficient What-If Analyses of Invocation Overhead and Transactional Boundaries to Support the Migration to Microservices [0.3222802562733786]
Microservice architecture improves agility and maintainability of software systems. Decomposing existing software into out-of-process components can have a severe impact on non-functional properties. What-if analyses allow to explore different scenarios and to develop the service boundaries in an iterative and incremental way.
arXiv Detail & Related papers (2025-01-30T09:42:56Z)
Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks [8.354305051472735]
This study demonstrates that integrating a robust Graph-RAG framework with advanced prompt engineering techniques, such as Chain of Thought and Tree of Thought, can significantly enhance performance. It is both costly and more complex to implement across diverse contexts, requiring careful adaptation to specific scenarios.
arXiv Detail & Related papers (2024-12-11T18:11:39Z)
The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving [8.552242818726347]
INFERMAX is an analytical framework that uses inference cost models to compare various schedulers. Our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all.
arXiv Detail & Related papers (2024-11-12T00:10:34Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks [1.3398445165628463]
This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment. Our results indicate that Flink is the most stable and has one of the best fault recovery. K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
arXiv Detail & Related papers (2024-04-09T10:49:23Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
A Microservices Identification Method Based on Spectral Clustering for Industrial Legacy Systems [5.255685751491305]
We propose an automated microservice decomposition method for extracting microservice candidates based on spectral graph theory. We show that our method can yield favorable results even without the involvement of domain experts.
arXiv Detail & Related papers (2023-12-20T07:47:01Z)
Can LLMs Configure Software Tools [0.76146285961466]
In software engineering, the meticulous configuration of software tools is crucial in ensuring optimal performance within intricate systems. In this study, we embark on an exploration of leveraging Large-Language Models (LLMs) to streamline the software configuration process. Our work presents a novel approach that employs LLMs, such as Chat-GPT, to identify starting conditions and narrow down the search space, improving configuration efficiency.
arXiv Detail & Related papers (2023-12-11T05:03:02Z)
FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z)
Distributed intelligence on the Edge-to-Cloud Continuum: A systematic literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.