Related papers: Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

URL: http://arxiv.org/abs/2507.01676v1
Date: Wed, 02 Jul 2025 13:00:39 GMT
Title: Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization
Authors: Giuseppe Ruggeri, Renzo Andri, Daniele Jahier Pagliari, Lukas Cavigelli,
Abstract summary: Deep Recommender Models (DLRMs) inference accounts for more than 79% of the total AI workload in Meta's data centers.<n>We propose the design of tailored data flows to speedup embedding look-ups.
Score: 4.08734863805696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep Recommender Models (DLRMs) inference is a fundamental AI workload accounting for more than 79% of the total AI workload in Meta's data centers. DLRMs' performance bottleneck is found in the embedding layers, which perform many random memory accesses to retrieve small embedding vectors from tables of various sizes. We propose the design of tailored data flows to speedup embedding look-ups. Namely, we propose four strategies to look up an embedding table effectively on one core, and a framework to automatically map the tables asymmetrically to the multiple cores of a SoC. We assess the effectiveness of our method using the Huawei Ascend AI accelerators, comparing it with the default Ascend compiler, and we perform high-level comparisons with Nvidia A100. Results show a speed-up varying from 1.5x up to 6.5x for real workload distributions, and more than 20x for extremely unbalanced distributions. Furthermore, the method proves to be much more independent of the query distribution than the baseline.

Related papers

DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition [24.71497121634708]
Varying-size models are often required to deploy ASR systems under different hardware and/or application constraints. We present the dynamic encoder size approach, which jointly trains multiple performant models within one supernet from scratch.
arXiv Detail & Related papers (2024-07-10T08:35:21Z)
ASP: Automatic Selection of Proxy dataset for efficient AutoML [16.813109584129514]
We propose an Automatic Selection of Proxy dataset framework (ASP) to dynamically find the informative proxy subsets of training data at each epoch. ASP can obtain better results than other data selection methods at all selection ratios.
arXiv Detail & Related papers (2023-10-17T09:36:22Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Mem-Rec: Memory Efficient Recommendation System using Alternative Representation [6.542635536704625]
MEM-REC is a novel alternative representation approach for embedding tables. We show that MEM-REC can not only maintain the recommendation quality but can also improve the embedding latency.
arXiv Detail & Related papers (2023-05-12T02:36:07Z)
BagPipe: Accelerating Deep Recommendation Model Training [9.911467752221863]
Bagpipe is a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions.
arXiv Detail & Related papers (2022-02-24T23:54:12Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks. We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.