Related papers: Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

URL: http://arxiv.org/abs/2011.02084v2
Date: Wed, 11 Nov 2020 16:31:05 GMT
Title: Understanding Capacity-Driven Scale-Out Neural Recommendation Inference
Authors: Michael Lui, Yavuz Yetim, \"Ozg\"ur \"Ozkan, Zhuoran Zhao, Shin-Yeh Tsai, Carole-Jean Wu, and Mark Hempstead
Abstract summary: This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution. Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
Score: 1.9529164002361878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner--P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

Related papers

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment [16.883389041355073]
We propose a framework designed for the development and deployment of hyperscale recommendation FMs.<n>In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge.<n>This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings.
arXiv Detail & Related papers (2025-08-04T22:03:13Z)
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation [58.194356020695906]
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications.
arXiv Detail & Related papers (2025-02-20T22:35:52Z)
A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation. We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Complement Sparsification: Low-Overhead Model Pruning for Federated Learning [2.0428960719376166]
Federated Learning (FL) is a privacy-preserving distributed deep learning paradigm that involves substantial communication and computation effort. Existing model pruning/sparsification solutions cannot satisfy the requirements for low bidirectional communication overhead between the server and the clients. We propose Complement Sparsification (CS), a pruning mechanism that satisfies all these requirements through a complementary and collaborative pruning done at the server and the clients.
arXiv Detail & Related papers (2023-03-10T23:07:02Z)
Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications. Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure. We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z)
A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z)
FedNet2Net: Saving Communication and Computations in Federated Learning with Model Growing [0.0]
Federated learning (FL) is a recently developed area of machine learning. In this paper, a novel scheme based on the notion of "model growing" is proposed. The proposed approach is tested extensively on three standard benchmarks and is shown to achieve substantial reduction in communication and client computation.
arXiv Detail & Related papers (2022-07-19T21:54:53Z)
An Expectation-Maximization Perspective on Federated Learning [75.67515842938299]
Federated learning describes the distributed training of models across multiple clients while keeping the data private on-device. In this work, we view the server-orchestrated federated learning process as a hierarchical latent variable model where the server provides the parameters of a prior distribution over the client-specific model parameters. We show that with simple Gaussian priors and a hard version of the well known Expectation-Maximization (EM) algorithm, learning in such a model corresponds to FedAvg, the most popular algorithm for the federated learning setting.
arXiv Detail & Related papers (2021-11-19T12:58:59Z)
A Bayesian Federated Learning Framework with Online Laplace Approximation [144.7345013348257]
Federated learning allows multiple clients to collaboratively learn a globally shared model. We propose a novel FL framework that uses online Laplace approximation to approximate posteriors on both the client and server side. We achieve state-of-the-art results on several benchmarks, clearly demonstrating the advantages of the proposed method.
arXiv Detail & Related papers (2021-02-03T08:36:58Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.