Evaluation of Load Prediction Techniques for Distributed Stream
Processing
- URL: http://arxiv.org/abs/2108.04749v1
- Date: Tue, 10 Aug 2021 15:25:32 GMT
- Title: Evaluation of Load Prediction Techniques for Distributed Stream
Processing
- Authors: Kordian Gontarska, Morgan Geldenhuys, Dominik Scheinert, Philipp
Wiesner, Andreas Polze, Lauritz Thamsen
- Abstract summary: Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time.
The rate at which events arrive at DSP systems can vary considerably over time.
A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed Stream Processing (DSP) systems enable processing large streams
of continuous data to produce results in near to real time. They are an
essential part of many data-intensive applications and analytics platforms. The
rate at which events arrive at DSP systems can vary considerably over time,
which may be due to trends, cyclic, and seasonal patterns within the data
streams. A priori knowledge of incoming workloads enables proactive approaches
to resource management and optimization tasks such as dynamic scaling, live
migration of resources, and the tuning of configuration parameters during
run-times, thus leading to a potentially better Quality of Service.
In this paper we conduct a comprehensive evaluation of different load
prediction techniques for DSP jobs. We identify three use-cases and formulate
requirements for making load predictions specific to DSP jobs. Automatically
optimized classical and Deep Learning methods are being evaluated on nine
different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster
monitoring. We compare model performance with respect to overall accuracy and
training duration. Our results show that the Deep Learning methods provide the
most accurate load predictions for the majority of the evaluated datasets.
Related papers
- DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - Advancing Enterprise Spatio-Temporal Forecasting Applications: Data Mining Meets Instruction Tuning of Language Models For Multi-modal Time Series Analysis in Low-Resource Settings [0.0]
patio-temporal forecasting is crucial in transportation, logistics, and supply chain management.
We propose a dynamic, multi-modal approach that integrates the strengths of traditional forecasting methods and instruction tuning of small language models.
Our framework enables on-premises customization with reduced computational and memory demands, while maintaining inference speed and data privacy/security.
arXiv Detail & Related papers (2024-08-24T16:32:58Z) - Rethinking Resource Management in Edge Learning: A Joint Pre-training and Fine-tuning Design Paradigm [87.47506806135746]
In some applications, edge learning is experiencing a shift in focusing from conventional learning from scratch to new two-stage learning.
This paper considers the problem of joint communication and computation resource management in a two-stage edge learning system.
It is shown that the proposed joint resource management over the pre-training and fine-tuning stages well balances the system performance trade-off.
arXiv Detail & Related papers (2024-04-01T00:21:11Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Understand Data Preprocessing for Effective End-to-End Training of Deep
Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files.
We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices.
We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time.
Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z) - On the Potential of Execution Traces for Batch Processing Workload
Optimization in Public Clouds [0.0]
We propose a collaborative approach for sharing anonymized workload execution traces among users.
We mining them for general patterns, and exploiting clusters of historical workloads for future optimizations.
arXiv Detail & Related papers (2021-11-16T20:11:36Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning [35.0157090322113]
Large-scale machine learning systems are often continuously trained with enormous data from production environments.
The sheer volume of streaming data poses a significant challenge to real-time training subsystems and ad-hoc sampling is the standard practice.
We propose to record a constant amount of information per instance from these forward passes. The extra information measurably improves the selection of which data instances should participate in forward and backward passes.
arXiv Detail & Related papers (2021-04-27T11:29:02Z) - Online feature selection for rapid, low-overhead learning in networked
systems [0.0]
We present an online algorithm, called OSFS, that selects a small feature set from a large number of available data sources.
We find that OSFS requires several hundreds measurements to reduce the number of data sources by two orders of magnitude.
arXiv Detail & Related papers (2020-10-28T12:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.