Understanding Capacity-Driven Scale-Out Neural Recommendation Inference
- URL: http://arxiv.org/abs/2011.02084v2
- Date: Wed, 11 Nov 2020 16:31:05 GMT
- Title: Understanding Capacity-Driven Scale-Out Neural Recommendation Inference
- Authors: Michael Lui, Yavuz Yetim, \"Ozg\"ur \"Ozkan, Zhuoran Zhao, Shin-Yeh
Tsai, Carole-Jean Wu, and Mark Hempstead
- Abstract summary: This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure.
We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution.
Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
- Score: 1.9529164002361878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation models have grown to the terabyte scale.
Traditional serving schemes--that load entire models to a single server--are
unable to support this scale. One approach to support this scale is with
distributed serving, or distributed inference, which divides the memory
requirements of a single large model across multiple servers.
This work is a first-step for the systems research community to develop novel
model-serving solutions, given the huge system design space. Large-scale deep
recommender systems are a novel workload and vital to study, as they consume up
to 79% of all inference cycles in the data center. To that end, this work
describes and characterizes scale-out deep learning recommendation inference
using data-center serving infrastructure. This work specifically explores
latency-bounded inference systems, compared to the throughput-oriented training
systems of other recent works. We find that the latency and compute overheads
of distributed inference are largely a result of a model's static embedding
table distribution and sparsity of input inference requests. We further
evaluate three embedding table mapping strategies of three DLRM-like models and
specify challenging design trade-offs in terms of end-to-end latency, compute
overhead, and resource efficiency. Overall, we observe only a marginal latency
overhead when the data-center scale recommendation models are served with the
distributed inference manner--P99 latency is increased by only 1% in the best
case configuration. The latency overheads are largely a result of the commodity
infrastructure used and the sparsity of embedding tables. Even more
encouragingly, we also show how distributed inference can account for
efficiency improvements in data-center scale recommendation serving.
Related papers
- A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning.
Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation.
We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Complement Sparsification: Low-Overhead Model Pruning for Federated
Learning [2.0428960719376166]
Federated Learning (FL) is a privacy-preserving distributed deep learning paradigm that involves substantial communication and computation effort.
Existing model pruning/sparsification solutions cannot satisfy the requirements for low bidirectional communication overhead between the server and the clients.
We propose Complement Sparsification (CS), a pruning mechanism that satisfies all these requirements through a complementary and collaborative pruning done at the server and the clients.
arXiv Detail & Related papers (2023-03-10T23:07:02Z) - Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications.
Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure.
We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z) - A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data.
Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z) - FedNet2Net: Saving Communication and Computations in Federated Learning
with Model Growing [0.0]
Federated learning (FL) is a recently developed area of machine learning.
In this paper, a novel scheme based on the notion of "model growing" is proposed.
The proposed approach is tested extensively on three standard benchmarks and is shown to achieve substantial reduction in communication and client computation.
arXiv Detail & Related papers (2022-07-19T21:54:53Z) - An Expectation-Maximization Perspective on Federated Learning [75.67515842938299]
Federated learning describes the distributed training of models across multiple clients while keeping the data private on-device.
In this work, we view the server-orchestrated federated learning process as a hierarchical latent variable model where the server provides the parameters of a prior distribution over the client-specific model parameters.
We show that with simple Gaussian priors and a hard version of the well known Expectation-Maximization (EM) algorithm, learning in such a model corresponds to FedAvg, the most popular algorithm for the federated learning setting.
arXiv Detail & Related papers (2021-11-19T12:58:59Z) - A Bayesian Federated Learning Framework with Online Laplace
Approximation [144.7345013348257]
Federated learning allows multiple clients to collaboratively learn a globally shared model.
We propose a novel FL framework that uses online Laplace approximation to approximate posteriors on both the client and server side.
We achieve state-of-the-art results on several benchmarks, clearly demonstrating the advantages of the proposed method.
arXiv Detail & Related papers (2021-02-03T08:36:58Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.