Related papers: cedar: Optimized and Unified Machine Learning Input Data Pipelines

cedar: Optimized and Unified Machine Learning Input Data Pipelines

URL: http://arxiv.org/abs/2401.08895v3
Date: Wed, 16 Oct 2024 17:54:15 GMT
Title: cedar: Optimized and Unified Machine Learning Input Data Pipelines
Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis,
Abstract summary: cedar is an optimized and unified programming framework for machine learning input data pipelines. cedar orchestrates processing across a customizable set of local and distributed compute resources. cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.
Score: 2.0375440421573843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources - or worse - underutilize expensive accelerators. To address these demands, we present cedar, an optimized and unified programming framework for ML input data pipelines. cedar allows users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. cedar introduces an extensible optimizer that systematically applies a complex combination of optimizations (e.g., offloading, caching, prefetching, fusion, and reordering). It orchestrates processing across a customizable set of local and distributed compute resources in order to improve processing performance and efficiency, all without user input. Across eight pipelines, cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.

Related papers

OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training [17.215899004049778]
We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations. OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement; (2) a minimum 3.6x reduction in CPU memory usage.
arXiv Detail & Related papers (2025-04-14T03:31:22Z)
Prior-Fitted Networks Scale to Larger Datasets When Treated as Weak Learners [82.72552644267724]
BoostPFN can outperform standard PFNs with the same size of training samples in large datasets. High performance is maintained for up to 50x of the pre-training size of PFNs.
arXiv Detail & Related papers (2025-03-03T07:31:40Z)
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization [74.34707794886751]
This paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability. We also propose IOPO, which takes both input and output preference pairs into consideration. Experiments on both in-domain and out-of-domain datasets confirm the effectiveness of IOPO.
arXiv Detail & Related papers (2024-11-09T15:12:43Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks [8.977436072381973]
We run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. We identify the potential causes, exercise a variety of optimization methods, and present their pros and cons.
arXiv Detail & Related papers (2023-04-18T11:57:38Z)
tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z)
dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training [12.413533491501548]
This paper proposes dPRO, a tool to identify performance bottlenecks in distributed training systems. We implement dPRO on multiple deep learning frameworks (PyTorch, MXNet, AllReduce and Server architecture) and representative communication schemes. Extensive experiments show that dPRO predicts performance of distributed training in various settings with5% errors in most cases and finds optimization strategies with up to87.1%-up over the baselines.
arXiv Detail & Related papers (2022-05-05T07:15:25Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data [86.8949732640035]
We propose JUMBO, an MBO algorithm that sidesteps limitations by querying additional data. We show that it achieves no-regret under conditions analogous to GP-UCB. Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems.
arXiv Detail & Related papers (2021-06-02T05:03:38Z)
tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.