Related papers: PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

URL: http://arxiv.org/abs/2406.14571v1
Date: Tue, 11 Jun 2024 05:26:45 GMT
Title: PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models
Authors: Yunjae Lee, Hyeseong Kim, Minsoo Rhu,
Abstract summary: PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP) We show that PreSto outperforms the baseline CPU-centric system with a $9.6times$ speedup in end-to-end preprocessing time.
Score: 3.781822234460176
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a $9.6\times$ speedup in end-to-end preprocessing time, $4.3\times$ enhancement in cost-efficiency, and $11.3\times$ improvement in energyefficiency on average for production-scale RecSys preprocessing.

Related papers

Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems [9.23424733090734]
Online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. We introduce Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion.
arXiv Detail & Related papers (2025-01-21T10:53:17Z)
Efficient Tabular Data Preprocessing of ML Pipelines [9.23424733090734]
Data preprocessing pipelines are a crucial component of Machine Learning (ML) training. Piper is a hardware accelerator for data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 $sim$ 71.3$times$ speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8$sim$ 20.3$times$ when using binary input.
arXiv Detail & Related papers (2024-09-23T11:07:57Z)
Rethinking Resource Management in Edge Learning: A Joint Pre-training and Fine-tuning Design Paradigm [87.47506806135746]
In some applications, edge learning is experiencing a shift in focusing from conventional learning from scratch to new two-stage learning. This paper considers the problem of joint communication and computation resource management in a two-stage edge learning system. It is shown that the proposed joint resource management over the pre-training and fine-tuning stages well balances the system performance trade-off.
arXiv Detail & Related papers (2024-04-01T00:21:11Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z)
Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge. We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training [5.058493679956239]
We present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training. First, dataset storage requirements are massive and variable; exceeding local storage capacities. Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves. We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%.
arXiv Detail & Related papers (2021-08-20T21:09:34Z)
Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service. We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.