Related papers: High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems

High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems

URL: http://arxiv.org/abs/2103.00686v2
Date: Tue, 2 Mar 2021 19:16:36 GMT
Title: High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems
Authors: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair
Abstract summary: Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store a numerical representation of item's and user's categorical variables. Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
Score: 2.708848417398231
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. Current recommendation models include deep-learning-based (DLRM) and time-based sequence (TBSM) models. These models use massive embedding tables to store a numerical representation of item's and user's categorical variables (memory-bound) while also using neural networks to generate outputs (compute-bound). Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU for embedding and neural network executions, respectively. Such a training process naively assigns the same level of importance to each embedding entry. This paper observes that some training inputs and their accesses into the embedding tables are heavily skewed with certain entries being accessed up to 10000x more. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training. To this end, this paper proposes a Frequently Accessed Embeddings (FAE) framework that exposes a dynamic knob to the software based on the GPU memory capacity and the input popularity index. This framework efficiently estimates and varies the size of the hot portions of the embedding tables within GPUs and reallocates the rest of the embeddings on the CPU. Overall, our framework speeds-up the training of the recommendation models on Kaggle, Terabyte, and Alibaba datasets by 2.34x as compared to a baseline that uses Intel-Xeon CPUs and Nvidia Tesla-V100 GPUs, while maintaining accuracy.

Related papers

Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning [0.3277163122167433]
This paper presents a framework for mapping malicious executables to known Persistent Advanced Threat (APT) groups. The main feature of this analysis is the assembly-level instructions present in executables which are also known as opcodes. Traditional and deep learning models are applied to create models capable of classifying malware samples.
arXiv Detail & Related papers (2025-04-22T00:05:05Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations. For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z)
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices. We rebalanced the computational loads from data, model, and memory perspectives to address this issue. Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning. It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)
A Frequency-aware Software Cache for Large Recommendation System Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space. Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z)
Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables. These models are typically trained using hybrid-GPU or GPU-only configurations. This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z)
Survey on Large Scale Neural Network Training [48.424512364338746]
Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. This survey provides a systematic overview of the approaches that enable more efficient DNNs training.
arXiv Detail & Related papers (2022-02-21T18:48:02Z)
Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. We identify two major challenges in the existing GPU training for massivescale ad models. We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z)
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently. We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z)
Understanding Training Efficiency of Deep Learning Recommendation Models at Scale [8.731263641794897]
This paper explains the intricacies of using GPUs for training recommendation models. factors affecting hardware efficiency at scale, and learnings from a new scale-up GPU server design, Zion.
arXiv Detail & Related papers (2020-11-11T01:21:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.