Understanding Training Efficiency of Deep Learning Recommendation Models
at Scale
- URL: http://arxiv.org/abs/2011.05497v1
- Date: Wed, 11 Nov 2020 01:21:43 GMT
- Title: Understanding Training Efficiency of Deep Learning Recommendation Models
at Scale
- Authors: Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu,
Kim Hazelwood
- Abstract summary: This paper explains the intricacies of using GPUs for training recommendation models.
factors affecting hardware efficiency at scale, and learnings from a new scale-up GPU server design, Zion.
- Score: 8.731263641794897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of GPUs has proliferated for machine learning workflows and is now
considered mainstream for many deep learning models. Meanwhile, when training
state-of-the-art personal recommendation models, which consume the highest
number of compute cycles at our large-scale datacenters, the use of GPUs came
with various challenges due to having both compute-intensive and
memory-intensive components. GPU performance and efficiency of these
recommendation models are largely affected by model architecture configurations
such as dense and sparse features, MLP dimensions. Furthermore, these models
often contain large embedding tables that do not fit into limited GPU memory.
The goal of this paper is to explain the intricacies of using GPUs for training
recommendation models, factors affecting hardware efficiency at scale, and
learnings from a new scale-up GPU server design, Zion.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Forecasting GPU Performance for Deep Learning Training and Inference [10.741682409837612]
NeuSight is a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution.
NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU.
It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works.
arXiv Detail & Related papers (2024-07-18T18:47:52Z) - The Case for Co-Designing Model Architectures with Hardware [13.022505733049597]
We provide a set of guidelines for users to maximize the runtime performance of their transformer models.
We find the throughput of models with efficient model shapes is up to 39% higher.
arXiv Detail & Related papers (2024-01-25T19:50:31Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Survey on Large Scale Neural Network Training [48.424512364338746]
Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training.
This survey provides a systematic overview of the approaches that enable more efficient DNNs training.
arXiv Detail & Related papers (2022-02-21T18:48:02Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems [2.708848417398231]
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications.
These models use massive embedding tables to store a numerical representation of item's and user's categorical variables.
Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU.
This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
arXiv Detail & Related papers (2021-03-01T01:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.