Related papers: Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

URL: http://arxiv.org/abs/2308.11792v2
Date: Thu, 23 Nov 2023 14:56:37 GMT
Title: Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics
Authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, and Odej Kao
Abstract summary: Karasu is an approach to more efficient resource configuration profiling. It promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost.
Score: 3.779250782197386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.

Related papers

A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously. We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud Detection [12.076075765740502]
We propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further.
arXiv Detail & Related papers (2022-08-12T02:58:26Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets. We propose a novel general and efficient active learning (GEAL) method in this paper. Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z)
On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds [0.0]
We propose a collaborative approach for sharing anonymized workload execution traces among users. We mining them for general patterns, and exploiting clusters of historical workloads for future optimizations.
arXiv Detail & Related papers (2021-11-16T20:11:36Z)
Towards Federated Bayesian Network Structure Learning with Continuous Optimization [14.779035801521717]
We present a cross-silo federated learning approach to estimate the structure of Bayesian network. We develop a distributed structure learning method based on continuous optimization.
arXiv Detail & Related papers (2021-10-18T14:36:05Z)
Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation [52.9168275057997]
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs. We show that Enel is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
arXiv Detail & Related papers (2021-08-27T10:21:08Z)
Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts [52.9168275057997]
This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job. We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments.
arXiv Detail & Related papers (2021-07-29T11:57:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.