Related papers: Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

URL: http://arxiv.org/abs/2108.09373v1
Date: Fri, 20 Aug 2021 21:09:34 GMT
Title: Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training
Authors: Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, Parik Pol
Abstract summary: We present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training. First, dataset storage requirements are massive and variable; exceeding local storage capacities. Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves. We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%.
Score: 5.058493679956239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The data ingestion pipeline, responsible for storing and preprocessing training data, is an important component of any machine learning training job. At Facebook, we use recommendation models extensively across our services. The data ingestion requirements to train these models are substantial. In this paper, we present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training. First, dataset storage requirements are massive and variable; exceeding local storage capacities. Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves. These demands result in drastically reduced training throughput, and thus wasted GPU resources, when current on-trainer preprocessing solutions are used. To address these challenges, we present a disaggregated data ingestion pipeline. It includes a central data warehouse built on distributed storage nodes. We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%. We implement important optimizations across storage and DPP, increasing storage and preprocessing throughput by 1.9x and 2.3x, respectively, addressing the substantial power requirements of data ingestion. We close with lessons learned and cover the important remaining challenges and opportunities surrounding data ingestion at scale.

Related papers

A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices [18.853357902416832]
Current on-device model training is hampered by low training throughput, limited storage and diverse data importance.<n>We propose a two-stage data selection framework sf Titan to select the most important data batch from streaming data for model training.<n>sf Titan achieves up to $43%$ reduction in training time and $6.2%$ increase in final accuracy with minor system overhead.
arXiv Detail & Related papers (2025-05-22T11:53:48Z)
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training [17.215899004049778]
We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations. OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement; (2) a minimum 3.6x reduction in CPU memory usage.
arXiv Detail & Related papers (2025-04-14T03:31:22Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
RINAS: Training with Dataset Shuffling Can Be General and Fast [2.485503195398027]
RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets. We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision. Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
arXiv Detail & Related papers (2023-12-04T21:50:08Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z)
How Much More Data Do I Need? Estimating Requirements for Downstream Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time. We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation. Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.