cedar: Optimized and Unified Machine Learning Input Data Pipelines
- URL: http://arxiv.org/abs/2401.08895v3
- Date: Wed, 16 Oct 2024 17:54:15 GMT
- Title: cedar: Optimized and Unified Machine Learning Input Data Pipelines
- Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis,
- Abstract summary: cedar is an optimized and unified programming framework for machine learning input data pipelines.
cedar orchestrates processing across a customizable set of local and distributed compute resources.
cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.
- Score: 2.0375440421573843
- License:
- Abstract: The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources - or worse - underutilize expensive accelerators. To address these demands, we present cedar, an optimized and unified programming framework for ML input data pipelines. cedar allows users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. cedar introduces an extensible optimizer that systematically applies a complex combination of optimizations (e.g., offloading, caching, prefetching, fusion, and reordering). It orchestrates processing across a customizable set of local and distributed compute resources in order to improve processing performance and efficiency, all without user input. Across eight pipelines, cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.
Related papers
- IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization [74.34707794886751]
This paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability.
We also propose IOPO, which takes both input and output preference pairs into consideration.
Experiments on both in-domain and out-of-domain datasets confirm the effectiveness of IOPO.
arXiv Detail & Related papers (2024-11-09T15:12:43Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Understand Data Preprocessing for Effective End-to-End Training of Deep
Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files.
We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z) - tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt.
To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z) - dPRO: A Generic Profiling and Optimization System for Expediting
Distributed DNN Training [12.413533491501548]
This paper proposes dPRO, a tool to identify performance bottlenecks in distributed training systems.
We implement dPRO on multiple deep learning frameworks (PyTorch, MXNet, AllReduce and Server architecture) and representative communication schemes.
Extensive experiments show that dPRO predicts performance of distributed training in various settings with5% errors in most cases and finds optimization strategies with up to87.1%-up over the baselines.
arXiv Detail & Related papers (2022-05-05T07:15:25Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data [86.8949732640035]
We propose JUMBO, an MBO algorithm that sidesteps limitations by querying additional data.
We show that it achieves no-regret under conditions analogous to GP-UCB.
Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems.
arXiv Detail & Related papers (2021-06-02T05:03:38Z) - tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest.
We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.