Related papers: tf.data: A Machine Learning Data Processing Framework

tf.data: A Machine Learning Data Processing Framework

URL: http://arxiv.org/abs/2101.12127v2
Date: Tue, 23 Feb 2021 22:56:12 GMT
Title: tf.data: A Machine Learning Data Processing Framework
Authors: Derek G. Murray, Jiri Simsa, Ana Klimovic, Ihor Indyk
Abstract summary: Training machine learning models requires feeding input data for models to ingest. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
Score: 0.4588028371034406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transferring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides operators which can be parameterized with user-defined computation, composed, and reused across different machine learning domains. These abstractions allow users to focus on the application logic of data processing, while tf.data's runtime ensures that pipelines run efficiently. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while avoiding the need for manual tuning of performance knobs. We show that tf.data features, such as parallelism, caching, static optimizations, and non-deterministic execution are essential for high performance. Finally, we characterize machine learning input pipelines for millions of jobs that ran in Google's fleet, showing that input data processing is highly diverse and consumes a significant fraction of job resources. Our analysis motivates future research directions, such as sharing computation across jobs and pushing data projection to the storage layer.

Related papers

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models [17.091169031023714]
Data Pipeline plays an indispensable role in tasks such as modeling machine learning and developing data products. This paper focuses on exploring how to optimize data flow through automated machine learning methods. We will discuss how to leverage AutoML technology to enhance the intelligence of Data Pipeline.
arXiv Detail & Related papers (2024-02-20T11:06:42Z)
cedar: Optimized and Unified Machine Learning Input Data Pipelines [2.0375440421573843]
cedar is an optimized and unified programming framework for machine learning input data pipelines. cedar orchestrates processing across a customizable set of local and distributed compute resources. cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.
arXiv Detail & Related papers (2024-01-17T00:36:58Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision. We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time. The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.