tf.data: A Machine Learning Data Processing Framework
- URL: http://arxiv.org/abs/2101.12127v2
- Date: Tue, 23 Feb 2021 22:56:12 GMT
- Title: tf.data: A Machine Learning Data Processing Framework
- Authors: Derek G. Murray, Jiri Simsa, Ana Klimovic, Ihor Indyk
- Abstract summary: Training machine learning models requires feeding input data for models to ingest.
We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
- Score: 0.4588028371034406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training machine learning models requires feeding input data for models to
ingest. Input pipelines for machine learning jobs are often challenging to
implement efficiently as they require reading large volumes of data, applying
complex transformations, and transferring data to hardware accelerators while
overlapping computation and communication to achieve optimal performance. We
present tf.data, a framework for building and executing efficient input
pipelines for machine learning jobs. The tf.data API provides operators which
can be parameterized with user-defined computation, composed, and reused across
different machine learning domains. These abstractions allow users to focus on
the application logic of data processing, while tf.data's runtime ensures that
pipelines run efficiently.
We demonstrate that input pipeline performance is critical to the end-to-end
training time of state-of-the-art machine learning models. tf.data delivers the
high performance required, while avoiding the need for manual tuning of
performance knobs. We show that tf.data features, such as parallelism, caching,
static optimizations, and non-deterministic execution are essential for high
performance. Finally, we characterize machine learning input pipelines for
millions of jobs that ran in Google's fleet, showing that input data processing
is highly diverse and consumes a significant fraction of job resources. Our
analysis motivates future research directions, such as sharing computation
across jobs and pushing data projection to the storage layer.
Related papers
- Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of
Machine Learning Models [17.091169031023714]
Data Pipeline plays an indispensable role in tasks such as modeling machine learning and developing data products.
This paper focuses on exploring how to optimize data flow through automated machine learning methods.
We will discuss how to leverage AutoML technology to enhance the intelligence of Data Pipeline.
arXiv Detail & Related papers (2024-02-20T11:06:42Z) - cedar: Optimized and Unified Machine Learning Input Data Pipelines [2.0375440421573843]
cedar is an optimized and unified programming framework for machine learning input data pipelines.
cedar orchestrates processing across a customizable set of local and distributed compute resources.
cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.
arXiv Detail & Related papers (2024-01-17T00:36:58Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt.
To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.