A Frequency-aware Software Cache for Large Recommendation System
Embeddings
- URL: http://arxiv.org/abs/2208.05321v1
- Date: Mon, 8 Aug 2022 12:08:05 GMT
- Title: A Frequency-aware Software Cache for Large Recommendation System
Embeddings
- Authors: Jiarui Fang and Geng Zhang and Jiatong Han and Shenggui Li and Zhengda
Bian and Yongbin Li and Jin Liu and Yang You
- Abstract summary: Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
- Score: 11.873521953539361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation models (DLRMs) have been widely applied in
Internet companies. The embedding tables of DLRMs are too large to fit on GPU
memory entirely. We propose a GPU-based software cache approaches to
dynamically manage the embedding table in the CPU and GPU memory space by
leveraging the id's frequency statistics of the target dataset. Our proposed
software cache is efficient in training entire DLRMs on GPU in a synchronized
update manner. It is also scaled to multiple GPUs in combination with the
widely used hybrid parallel training approaches. Evaluating our prototype
system shows that we can keep only 1.5% of the embedding parameters in the GPU
to obtain a decent end-to-end training speed.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables.
These models are typically trained using hybrid-GPU or GPU-only configurations.
This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU.
ASH achieves higher performance, supports richer functionality, and requires fewer lines of code.
ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z) - GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z) - High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems [2.708848417398231]
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications.
These models use massive embedding tables to store a numerical representation of item's and user's categorical variables.
Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU.
This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
arXiv Detail & Related papers (2021-03-01T01:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.