Building a Performance Model for Deep Learning Recommendation Model
Training on GPUs
- URL: http://arxiv.org/abs/2201.07821v1
- Date: Wed, 19 Jan 2022 19:05:42 GMT
- Title: Building a Performance Model for Deep Learning Recommendation Model
Training on GPUs
- Authors: Zhongyi Lin and Louis Feng and Ehsan K. Ardestani and Jaewon Lee and
John Lundell and Changkyu Kim and Arun Kejariwal and John D. Owens
- Abstract summary: We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM)
We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time.
We propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph.
- Score: 6.05245376098191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We devise a performance model for GPU training of Deep Learning
Recommendation Models (DLRM), whose GPU utilization is low compared to other
well-optimized CV and NLP models. We show that both the device active time (the
sum of kernel runtimes) and the device idle time are important components of
the overall device time. We therefore tackle them separately by (1) flexibly
adopting heuristic-based and ML-based kernel performance models for operators
that dominate the device active time, and (2) categorizing operator overheads
into five types to determine quantitatively their contribution to the device
active time. Combining these two parts, we propose a critical-path-based
algorithm to predict the per-batch training time of DLRM by traversing its
execution graph. We achieve less than 10% geometric mean average error (GMAE)
in all kernel performance modeling, and 5.23% and 7.96% geomean errors for GPU
active time and overall end-to-end per-batch training time prediction,
respectively. We show that our general performance model not only achieves low
prediction error on DLRM, which has highly customized configurations and is
dominated by multiple factors, but also yields comparable accuracy on other
compute-bound ML models targeted by most previous methods. Using this
performance model and graph-level data and task dependency analyses, we show
our system can provide more general model-system co-design than previous
methods.
Related papers
- Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems.
Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models.
It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Benchmarking Data Efficiency and Computational Efficiency of Temporal
Action Localization Models [42.06124795143787]
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end.
This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power.
arXiv Detail & Related papers (2023-08-24T20:59:55Z) - Structured Cooperative Learning with Graphical Model Priors [98.53322192624594]
We study how to train personalized models for different tasks on decentralized devices with limited local data.
We propose "Structured Cooperative Learning (SCooL)", in which a cooperation graph across devices is generated by a graphical model.
We evaluate SCooL and compare it with existing decentralized learning methods on an extensive set of benchmarks.
arXiv Detail & Related papers (2023-06-16T02:41:31Z) - EAMDrift: An interpretable self retrain model for time series [0.0]
We present EAMDrift, a novel method that combines forecasts from multiple individual predictors by weighting each prediction according to a performance metric.
EAMDrift is designed to automatically adapt to out-of-distribution patterns in data and identify the most appropriate models to use at each moment.
Our study on real-world datasets shows that EAMDrift outperforms individual baseline models by 20% and achieves comparable accuracy results to non-interpretable ensemble models.
arXiv Detail & Related papers (2023-05-31T13:25:26Z) - Towards a learning-based performance modeling for accelerating Deep
Neural Networks [1.1549572298362785]
We start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs)
Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library.
arXiv Detail & Related papers (2022-12-09T18:28:07Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Using Graph Neural Networks to model the performance of Deep Neural
Networks [2.1151356984322307]
We develop a novel performance model that adopts a graph representation.
Experimental evaluation shows a 7:75x and 12x reduction in prediction error compared to the Halide and TVM models, respectively.
arXiv Detail & Related papers (2021-08-27T20:20:17Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.