High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems
- URL: http://arxiv.org/abs/2103.00686v2
- Date: Tue, 2 Mar 2021 19:16:36 GMT
- Title: High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems
- Authors: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant
J. Nair
- Abstract summary: Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications.
These models use massive embedding tables to store a numerical representation of item's and user's categorical variables.
Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU.
This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
- Score: 2.708848417398231
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recommendation models are commonly used learning models that suggest relevant
items to a user for e-commerce and online advertisement-based applications.
Current recommendation models include deep-learning-based (DLRM) and time-based
sequence (TBSM) models. These models use massive embedding tables to store a
numerical representation of item's and user's categorical variables
(memory-bound) while also using neural networks to generate outputs
(compute-bound). Due to these conflicting compute and memory requirements, the
training process for recommendation models is divided across CPU and GPU for
embedding and neural network executions, respectively. Such a training process
naively assigns the same level of importance to each embedding entry. This
paper observes that some training inputs and their accesses into the embedding
tables are heavily skewed with certain entries being accessed up to 10000x
more. This paper tries to leverage skewed embedded table accesses to
efficiently use the GPU resources during training. To this end, this paper
proposes a Frequently Accessed Embeddings (FAE) framework that exposes a
dynamic knob to the software based on the GPU memory capacity and the input
popularity index. This framework efficiently estimates and varies the size of
the hot portions of the embedding tables within GPUs and reallocates the rest
of the embeddings on the CPU. Overall, our framework speeds-up the training of
the recommendation models on Kaggle, Terabyte, and Alibaba datasets by 2.34x as
compared to a baseline that uses Intel-Xeon CPUs and Nvidia Tesla-V100 GPUs,
while maintaining accuracy.
Related papers
- In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables.
These models are typically trained using hybrid-GPU or GPU-only configurations.
This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z) - Survey on Large Scale Neural Network Training [48.424512364338746]
Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training.
This survey provides a systematic overview of the approaches that enable more efficient DNNs training.
arXiv Detail & Related papers (2022-02-21T18:48:02Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies.
To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently.
We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z) - Understanding Training Efficiency of Deep Learning Recommendation Models
at Scale [8.731263641794897]
This paper explains the intricacies of using GPUs for training recommendation models.
factors affecting hardware efficiency at scale, and learnings from a new scale-up GPU server design, Zion.
arXiv Detail & Related papers (2020-11-11T01:21:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.