Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems
- URL: http://arxiv.org/abs/2501.12032v2
- Date: Fri, 24 Jan 2025 08:51:54 GMT
- Title: Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems
- Authors: Yu Zhu, Wenqi Jiang, Gustavo Alonso,
- Abstract summary: Online data preprocessing plays an increasingly important role in serving recommender systems.
Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node.
We introduce Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion.
- Score: 9.23424733090734
- License:
- Abstract: Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39$\sim$105$\times$ speedup over a server-grade, 128-core CPU and 3$\sim$17$\times$ speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.
Related papers
- Efficient Tabular Data Preprocessing of ML Pipelines [9.23424733090734]
Data preprocessing pipelines are a crucial component of Machine Learning (ML) training.
Piper is a hardware accelerator for data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems.
Piper achieves 4.7 $sim$ 71.3$times$ speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8$sim$ 20.3$times$ when using binary input.
arXiv Detail & Related papers (2024-09-23T11:07:57Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Efficient NLP Inference at the Edge via Elastic Pipelining [0.42970700836450487]
WRX reconciles the latency/memory tension via two novel techniques.
We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
arXiv Detail & Related papers (2022-07-11T17:15:57Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Plumber: Diagnosing and Removing Performance Bottlenecks in Machine
Learning Data Pipelines [7.022239953701528]
We propose Plumber, a tool for finding bottlenecks in Machine Learning (ML) input pipelines.
Across five representative ML pipelines, Plumber obtains speedups of up to 46x for pipelines.
By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.
arXiv Detail & Related papers (2021-11-07T17:15:57Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers [47.194426122333205]
PipeTransformer is a distributed training algorithm for Transformer models.
It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training.
We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
arXiv Detail & Related papers (2021-02-05T13:39:31Z) - Rethinking Learning-based Demosaicing, Denoising, and Super-Resolution
Pipeline [86.01209981642005]
We study the effects of pipelines on the mixture problem of learning-based DN, DM, and SR, in both sequential and joint solutions.
Our suggested pipeline DN$to$SR$to$DM yields consistently better performance than other sequential pipelines.
We propose an end-to-end Trinity Pixel Enhancement NETwork (TENet) that achieves state-of-the-art performance for the mixture problem.
arXiv Detail & Related papers (2019-05-07T13:19:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.