Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
- URL: http://arxiv.org/abs/2410.22249v1
- Date: Tue, 29 Oct 2024 17:13:54 GMT
- Title: Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
- Authors: Rishabh Jain, Vivek M. Bhasi, Adwait Jog, Anand Sivasubramaniam, Mahmut T. Kandemir, Chita R. Das,
- Abstract summary: We show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading to a 3.2x embedding-only performance slowdown.
We propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies.
Our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
- Score: 13.720423381263409
- License:
- Abstract: Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture [6.5386984667643695]
UpDLRM uses real-world processingin-memory hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency.
UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
arXiv Detail & Related papers (2024-06-20T02:20:21Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data.
Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z) - Optimizing Deep Learning Recommender Systems' Training On CPU Cluster
Architectures [56.69373580921888]
We focus on Recommender Systems which account for most of the AI cycles in cloud computing centers.
By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance.
arXiv Detail & Related papers (2020-05-10T14:40:16Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.