Related papers: Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems

Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems

URL: http://arxiv.org/abs/2501.12032v2
Date: Fri, 24 Jan 2025 08:51:54 GMT
Title: Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems
Authors: Yu Zhu, Wenqi Jiang, Gustavo Alonso,
Abstract summary: Online data preprocessing plays an increasingly important role in serving recommender systems.<n>Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node.<n>We introduce Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion.
Score: 9.23424733090734
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39$\sim$105$\times$ speedup over a server-grade, 128-core CPU and 3$\sim$17$\times$ speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.

Related papers

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices [13.786008100564185]
We propose a novel framework, called pipelined offloading (PIPO), for efficient inference on consumer devices. PIPO designs a fine-grained offloading pipeline, complemented with optimized data transfer and computation, to achieve high and efficient scheduling for inference.
arXiv Detail & Related papers (2025-03-15T08:48:38Z)
Efficient Tabular Data Preprocessing of ML Pipelines [9.23424733090734]
Data preprocessing pipelines are a crucial component of Machine Learning (ML) training. Piper is a hardware accelerator for data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 $sim$ 71.3$times$ speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8$sim$ 20.3$times$ when using binary input.
arXiv Detail & Related papers (2024-09-23T11:07:57Z)
PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models [3.781822234460176]
PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP) We show that PreSto outperforms the baseline CPU-centric system with a $9.6times$ speedup in end-to-end preprocessing time.
arXiv Detail & Related papers (2024-06-11T05:26:45Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Efficient NLP Inference at the Edge via Elastic Pipelining [0.42970700836450487]
WRX reconciles the latency/memory tension via two novel techniques. We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
arXiv Detail & Related papers (2022-07-11T17:15:57Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines [7.022239953701528]
We propose Plumber, a tool for finding bottlenecks in Machine Learning (ML) input pipelines. Across five representative ML pipelines, Plumber obtains speedups of up to 46x for pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.
arXiv Detail & Related papers (2021-11-07T17:15:57Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers [47.194426122333205]
PipeTransformer is a distributed training algorithm for Transformer models. It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
arXiv Detail & Related papers (2021-02-05T13:39:31Z)
Rethinking Learning-based Demosaicing, Denoising, and Super-Resolution Pipeline [86.01209981642005]
We study the effects of pipelines on the mixture problem of learning-based DN, DM, and SR, in both sequential and joint solutions. Our suggested pipeline DN$to$SR$to$DM yields consistently better performance than other sequential pipelines. We propose an end-to-end Trinity Pixel Enhancement NETwork (TENet) that achieves state-of-the-art performance for the mixture problem.
arXiv Detail & Related papers (2019-05-07T13:19:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.