Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across
Contexts
- URL: http://arxiv.org/abs/2107.13921v1
- Date: Thu, 29 Jul 2021 11:57:38 GMT
- Title: Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across
Contexts
- Authors: Dominik Scheinert, Lauritz Thamsen, Houkun Zhu, Jonathan Will,
Alexander Acker, Thorsten Wittkopp, Odej Kao
- Abstract summary: This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job.
We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments.
- Score: 52.9168275057997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed dataflow systems enable the use of clusters for scalable data
analytics. However, selecting appropriate cluster resources for a processing
job is often not straightforward. Performance models trained on historical
executions of a concrete job are helpful in such situations, yet they are
usually bound to a specific job execution context (e.g. node type, software
versions, job parameters) due to the few considered input parameters. Even in
case of slight context changes, such supportive models need to be retrained and
cannot benefit from historical execution data from related contexts.
This paper presents Bellamy, a novel modeling approach that combines
scale-outs, dataset sizes, and runtimes with additional descriptive properties
of a dataflow job. It is thereby able to capture the context of a job
execution. Moreover, Bellamy is realizing a two-step modeling approach. First,
a general model is trained on all the available data for a specific scalable
analytics algorithm, hereby incorporating data from different contexts.
Subsequently, the general model is optimized for the specific situation at
hand, based on the available data for the concrete context. We evaluate our
approach on two publicly available datasets consisting of execution data from
various dataflow jobs carried out in different environments, showing that
Bellamy outperforms state-of-the-art methods.
Related papers
- Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - A Case for Dataset Specific Profiling [0.9023847175654603]
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets.
With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications.
For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
arXiv Detail & Related papers (2022-08-01T18:38:05Z) - On the Potential of Execution Traces for Batch Processing Workload
Optimization in Public Clouds [0.0]
We propose a collaborative approach for sharing anonymized workload execution traces among users.
We mining them for general patterns, and exploiting clusters of historical workloads for future optimizations.
arXiv Detail & Related papers (2021-11-16T20:11:36Z) - Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using
Graph Propagation [52.9168275057997]
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs.
We show that Enel is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
arXiv Detail & Related papers (2021-08-27T10:21:08Z) - Spectral goodness-of-fit tests for complete and partial network data [1.7188280334580197]
We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data.
We show that our method, when applied to a specific model of interest, provides a straightforward, computationally fast way of selecting parameters.
Our method leads to improved community detection algorithms.
arXiv Detail & Related papers (2021-06-17T17:56:30Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.