cedar: Composable and Optimized Machine Learning Input Data Pipelines
- URL: http://arxiv.org/abs/2401.08895v2
- Date: Thu, 25 Jan 2024 06:04:30 GMT
- Title: cedar: Composable and Optimized Machine Learning Input Data Pipelines
- Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis
- Abstract summary: cedar is a programming model and framework that allows users to easily build, optimize, and execute input data pipelines.
cedar orchestrates processing across a customizable set of local and distributed compute resources.
cedar achieves a 2.49x, 1.87x, 2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch's DataLoader, respectively.
- Score: 2.2899953111727718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The input data pipeline is an essential component of each machine learning
(ML) training job. It is responsible for reading massive amounts of training
data, processing batches of samples using complex transformations, and loading
them onto training nodes at low latency and high throughput. Performant input
data systems are becoming increasingly critical, driven by skyrocketing data
volumes and training throughput demands. Unfortunately, current input data
systems cannot fully leverage key performance optimizations, resulting in
hugely inefficient infrastructures that require significant resources -- or
worse -- underutilize expensive accelerators.
To address these demands, we present cedar, a programming model and framework
that allows users to easily build, optimize, and execute input data pipelines.
cedar presents an easy-to-use programming interface, allowing users to define
input data pipelines using composable operators that support arbitrary ML
frameworks and libraries. Meanwhile, cedar transparently applies a complex and
extensible set of optimization techniques (e.g., offloading, caching,
prefetching, fusion, and reordering). It then orchestrates processing across a
customizable set of local and distributed compute resources in order to
maximize processing performance and efficiency, all without user input. On
average across six diverse input data pipelines, cedar achieves a 2.49x, 1.87x,
2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray
Data, and PyTorch's DataLoader, respectively.
Related papers
- HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator [47.66463010685586]
We propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization.
We achieve an efficiency improvement ranging from 1.3$times$ to 4.2$times$ compared to existing sparse designs.
arXiv Detail & Related papers (2024-06-05T09:25:18Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Understand Data Preprocessing for Effective End-to-End Training of Deep
Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files.
We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z) - tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt.
To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z) - Scalable Neural Data Server: A Data Recommender for Transfer Learning [70.06289658553675]
Transfer learning is a popular strategy for leveraging additional data to improve the downstream performance.
Nerve Data Server (NDS), a search engine that recommends relevant data for a given downstream task, has been previously proposed to address this problem.
NDS uses a mixture of experts trained on data sources to estimate similarity between each source and the downstream task.
SNDS represents both data sources and downstream tasks by their proximity to the intermediary datasets.
arXiv Detail & Related papers (2022-06-19T12:07:32Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Understanding and Co-designing the Data Ingestion Pipeline for
Industry-Scale RecSys Training [5.058493679956239]
We present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training.
First, dataset storage requirements are massive and variable; exceeding local storage capacities.
Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves.
We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56%.
arXiv Detail & Related papers (2021-08-20T21:09:34Z) - tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest.
We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.