Efficient Deep Learning Pipelines for Accurate Cost Estimations Over
Large Scale Query Workload
- URL: http://arxiv.org/abs/2103.12465v1
- Date: Tue, 23 Mar 2021 11:36:10 GMT
- Title: Efficient Deep Learning Pipelines for Accurate Cost Estimations Over
Large Scale Query Workload
- Authors: Johan Kok Zhi Kang, Gaurav, Sien Yi Tan, Feng Cheng, Shixuan Sun,
Bingsheng He
- Abstract summary: We develop a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces.
We evaluate our pipeline over 19K Presto OLAP queries from Grab, on a data lake of more than 20PB of data.
We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure.
- Score: 25.52190205651031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of deep learning models for forecasting the resource consumption
patterns of SQL queries have recently been a popular area of study. With many
companies using cloud platforms to power their data lakes for large scale
analytic demands, these models form a critical part of the pipeline in managing
cloud resource provisioning. While these models have demonstrated promising
accuracy, training them over large scale industry workloads are expensive.
Space inefficiencies of encoding techniques over large numbers of queries and
excessive padding used to enforce shape consistency across diverse query plans
implies 1) longer model training time and 2) the need for expensive, scaled up
infrastructure to support batched training. In turn, we developed Prestroid, a
tree convolution based data science pipeline that accurately predicts resource
consumption patterns of query traces, but at a much lower cost.
We evaluated our pipeline over 19K Presto OLAP queries from Grab, on a data
lake of more than 20PB of data. Experimental results imply that our pipeline
outperforms benchmarks on predictive accuracy, contributing to more precise
resource prediction for large-scale workloads, yet also reduces per-batch
memory footprint by 13.5x and per-epoch training time by 3.45x. We demonstrate
direct cost savings of up to 13.2x for large batched model training over
Microsoft Azure VMs.
Related papers
- Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Pushing the Limits of Pre-training for Time Series Forecasting in the
CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain.
We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size.
Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z) - Pre-training on Synthetic Driving Data for Trajectory Prediction [61.520225216107306]
We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting.
We adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them.
We conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies.
arXiv Detail & Related papers (2023-09-18T19:49:22Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Training Large Language Models Efficiently with Sparsity and Dataflow [3.1780195670658378]
This paper demonstrates an end-to-end training flow on a large language model - 13 billion GPT - using sparsity and dataflow.
We show that we can successfully train GPT 13B to the same quality as the dense GPT 13B model, while achieving an end-end speedup of 4.5x over dense A100 baseline.
arXiv Detail & Related papers (2023-04-11T21:37:13Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time.
We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation.
Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.