tf.data service: A Case for Disaggregating ML Input Data Processing
- URL: http://arxiv.org/abs/2210.14826v3
- Date: Tue, 2 Jan 2024 15:54:24 GMT
- Title: tf.data service: A Case for Disaggregating ML Input Data Processing
- Authors: Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa and
Chandramohan A. Thekkath
- Abstract summary: Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt.
To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
- Score: 4.851146762916078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) computations commonly execute on expensive specialized
hardware, such as GPUs and TPUs, which provide high FLOPs and
performance-per-watt. For cost efficiency, it is essential to keep these
accelerators highly utilized. This requires preprocessing input data at the
rate at which the accelerators can ingest and perform ML computations on the
data. To avoid data stalls, the host CPU and RAM required for input data
processing per accelerator core used for ML computations varies across jobs.
Hence, the traditional approach of processing input data on ML accelerator
hosts with a fixed hardware ratio leads to either under-utilizing the
accelerators or the host CPU and RAM. In this paper, we address these concerns
by building a disaggregated ML data processing system.
We present tf.data service, an open-source disaggregated input data
processing service built on top of tf.data in TensorFlow. We show that
disaggregating data preprocessing has three key advantages for large-scale ML
training jobs. First, the service can horizontally scale-out to right-size
CPU/RAM host resources for data processing in each job, saving 32x training
time and 26x cost, on average. Second, the service can share ephemeral
preprocessed data results across jobs, to optimize CPU usage and reduce
redundant computations. Finally, the service supports coordinated reads, a
technique that avoids stragglers due to different input sizes in distributed
training, reducing training time by 2.2x, on average. Our design is inspired by
lessons learned from deploying tf.data service in production, including
relaxing data visitation guarantees without impacting model accuracy.
Related papers
- cedar: Composable and Optimized Machine Learning Input Data Pipelines [2.2899953111727718]
cedar is a programming model and framework that allows users to easily build, optimize, and execute input data pipelines.
cedar orchestrates processing across a customizable set of local and distributed compute resources.
cedar achieves a 2.49x, 1.87x, 2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch's DataLoader, respectively.
arXiv Detail & Related papers (2024-01-17T00:36:58Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Understanding and Co-designing the Data Ingestion Pipeline for
Industry-Scale RecSys Training [5.058493679956239]
We present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training.
First, dataset storage requirements are massive and variable; exceeding local storage capacities.
Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves.
We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%.
arXiv Detail & Related papers (2021-08-20T21:09:34Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest.
We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z) - Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time.
We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation.
Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z) - Quantifying the Performance of Federated Transfer Learning [7.1423970352437385]
Federated Transfer Learning (FTL) is a solution to share data without violating data privacy.
FTL uses transfer learning techniques to utilize data from different sources for training.
Our paper tries to answer this question by quantitatively measuring a real-world FTL implementation on Google Cloud.
arXiv Detail & Related papers (2019-12-30T03:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.