Karasu: A Collaborative Approach to Efficient Cluster Configuration for
Big Data Analytics
- URL: http://arxiv.org/abs/2308.11792v2
- Date: Thu, 23 Nov 2023 14:56:37 GMT
- Title: Karasu: A Collaborative Approach to Efficient Cluster Configuration for
Big Data Analytics
- Authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz
Thamsen, Jonathan Will, and Odej Kao
- Abstract summary: Karasu is an approach to more efficient resource configuration profiling.
It promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets.
We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost.
- Score: 3.779250782197386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting the right resources for big data analytics jobs is hard because of
the wide variety of configuration options like machine type and cluster size.
As poor choices can have a significant impact on resource efficiency, cost, and
energy usage, automated approaches are gaining popularity. Most existing
methods rely on profiling recurring workloads to find near-optimal solutions
over time. Due to the cold-start problem, this often leads to lengthy and
costly profiling phases. However, big data analytics jobs across users can
share many common properties: they often operate on similar infrastructure,
using similar algorithms implemented in similar frameworks. The potential in
sharing aggregated profiling runs to collaboratively address the cold start
problem is largely unexplored.
We present Karasu, an approach to more efficient resource configuration
profiling that promotes data sharing among users working with similar
infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight
performance models using aggregated runtime information of collaborators and
combines them into an ensemble method to exploit inherent knowledge of the
configuration search space. Moreover, Karasu allows the optimization of
multiple objectives simultaneously. Our evaluation is based on performance data
from diverse workload executions in a public cloud environment. We show that
Karasu is able to significantly boost existing methods in terms of performance,
search time, and cost, even when few comparable profiling runs are available
that share only partial common characteristics with the target job.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with
Application to Fraud Detection [12.076075765740502]
We propose a new framework for efficient sparsity-aware K-means with three characteristics.
First, our framework is divided into a data-independent offline phase and a much faster online phase.
Second, we take advantage of the vectorization techniques in both online and offline phases.
Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further.
arXiv Detail & Related papers (2022-08-12T02:58:26Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets.
We propose a novel general and efficient active learning (GEAL) method in this paper.
Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z) - On the Potential of Execution Traces for Batch Processing Workload
Optimization in Public Clouds [0.0]
We propose a collaborative approach for sharing anonymized workload execution traces among users.
We mining them for general patterns, and exploiting clusters of historical workloads for future optimizations.
arXiv Detail & Related papers (2021-11-16T20:11:36Z) - Towards Federated Bayesian Network Structure Learning with Continuous
Optimization [14.779035801521717]
We present a cross-silo federated learning approach to estimate the structure of Bayesian network.
We develop a distributed structure learning method based on continuous optimization.
arXiv Detail & Related papers (2021-10-18T14:36:05Z) - Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using
Graph Propagation [52.9168275057997]
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs.
We show that Enel is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
arXiv Detail & Related papers (2021-08-27T10:21:08Z) - Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across
Contexts [52.9168275057997]
This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job.
We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments.
arXiv Detail & Related papers (2021-07-29T11:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.