A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models
- URL: http://arxiv.org/abs/2210.08804v1
- Date: Mon, 17 Oct 2022 07:36:18 GMT
- Title: A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models
- Authors: Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Kingsley Liu, Jerry
Shi and Joey Wang
- Abstract summary: Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data.
Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
- Score: 6.823233135936128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recommendation systems are of crucial importance for a variety of modern apps
and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep
learning with terabyte-scale embedding tables to obtain a fine-grained
representation of the underlying data. Traditional inference serving
architectures require deploying the whole model to standalone servers, which is
infeasible at such massive scale.
In this paper, we provide insights into the intriguing and challenging
inference domain of online recommendation systems. We propose the HugeCTR
Hierarchical Parameter Server (HPS), an industry-leading distributed
recommendation inference framework, that combines a high-performance GPU
embedding cache with an hierarchical storage architecture, to realize
low-latency retrieval of embeddings for online model inference tasks. Among
other things, HPS features (1) a redundant hierarchical storage system, (2) a
novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA
GPUs, (3) online training support and (4) light-weight APIs for easy
integration into existing large-scale recommendation workflows. To demonstrate
its capabilities, we conduct extensive studies using both synthetically
engineered and public datasets. We show that our HPS can dramatically reduce
end-to-end inference latency, achieving 5~62x speedup (depending on the batch
size) over CPU baseline implementations for popular recommendation models.
Through multi-GPU concurrent deployment, the HPS can also greatly increase the
inference QPS.
Related papers
- Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs [13.720423381263409]
We show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading to a 3.2x embedding-only performance slowdown.
We propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies.
Our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
arXiv Detail & Related papers (2024-10-29T17:13:54Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Communication-Efficient Graph Neural Networks with Probabilistic
Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs.
This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings.
We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z) - Generalized Latency Performance Estimation for Once-For-All Neural
Architecture Search [0.0]
We introduce two generalizability strategies which include fine-tuning using a base model trained on a specific hardware and NAS search space.
We provide a family of latency prediction models that achieve over 50% lower RMSE loss as compared to ProxylessNAS.
arXiv Detail & Related papers (2021-01-04T00:48:09Z) - Understanding Capacity-Driven Scale-Out Neural Recommendation Inference [1.9529164002361878]
This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure.
We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution.
Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
arXiv Detail & Related papers (2020-11-04T00:51:40Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.