Understanding and Co-designing the Data Ingestion Pipeline for
Industry-Scale RecSys Training
- URL: http://arxiv.org/abs/2108.09373v1
- Date: Fri, 20 Aug 2021 21:09:34 GMT
- Title: Understanding and Co-designing the Data Ingestion Pipeline for
Industry-Scale RecSys Training
- Authors: Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan,
Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu,
Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean
Wu, Christos Kozyrakis, Parik Pol
- Abstract summary: We present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training.
First, dataset storage requirements are massive and variable; exceeding local storage capacities.
Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves.
We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%.
- Score: 5.058493679956239
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The data ingestion pipeline, responsible for storing and preprocessing
training data, is an important component of any machine learning training job.
At Facebook, we use recommendation models extensively across our services. The
data ingestion requirements to train these models are substantial. In this
paper, we present an extensive characterization of the data ingestion
challenges for industry-scale recommendation model training. First, dataset
storage requirements are massive and variable; exceeding local storage
capacities. Secondly, reading and preprocessing data is computationally
expensive, requiring substantially more compute, memory, and network resources
than are available on trainers themselves. These demands result in drastically
reduced training throughput, and thus wasted GPU resources, when current
on-trainer preprocessing solutions are used. To address these challenges, we
present a disaggregated data ingestion pipeline. It includes a central data
warehouse built on distributed storage nodes. We introduce Data PreProcessing
Service (DPP), a fully disaggregated preprocessing service that scales to
hundreds of nodes, eliminating data stalls that can reduce training throughput
by 56%. We implement important optimizations across storage and DPP, increasing
storage and preprocessing throughput by 1.9x and 2.3x, respectively, addressing
the substantial power requirements of data ingestion. We close with lessons
learned and cover the important remaining challenges and opportunities
surrounding data ingestion at scale.
Related papers
- Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - RINAS: Training with Dataset Shuffling Can Be General and Fast [2.485503195398027]
RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets.
We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision.
Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
arXiv Detail & Related papers (2023-12-04T21:50:08Z) - Fast Machine Unlearning Without Retraining Through Selective Synaptic
Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data.
We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z) - Understand Data Preprocessing for Effective End-to-End Training of Deep
Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files.
We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time.
We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation.
Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.