Clairvoyant Prefetching for Distributed Machine Learning I/O
- URL: http://arxiv.org/abs/2101.08734v1
- Date: Thu, 21 Jan 2021 17:21:42 GMT
- Title: Clairvoyant Prefetching for Distributed Machine Learning I/O
- Authors: Roman B\"ohringer, Nikoli Dryden, Tal Ben-Nun, Torsten Hoefler
- Abstract summary: I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers.
We produce a novel machine learning I/O, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, scalable solution that delivers better performance than state-of-the-art approaches.
- Score: 9.490118207943192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: I/O is emerging as a major bottleneck for machine learning training,
especially in distributed environments such as clouds and supercomputers.
Optimal data ingestion pipelines differ between systems, and increasing
efficiency requires a delicate balance between access to local storage,
external filesystems, and remote workers; yet existing frameworks fail to
efficiently utilize such resources. We observe that, given the seed generating
the random access pattern for training with SGD, we have clairvoyance and can
exactly predict when a given sample will be accessed. We combine this with a
theoretical analysis of access patterns in training and performance modeling to
produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O
bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that
delivers better performance than state-of-the-art approaches while requiring
very few changes to existing codebases and supporting a broad range of
environments.
Related papers
- TDML -- A Trustworthy Distributed Machine Learning Framework [7.302091381583343]
The rapid advancement of large models (LM) has intensified the demand for computing resources.
This demand is exacerbated by limited availability due to supply chain delays and monopolistic acquisition by major tech firms.
We propose a textittrustworthy distributed machine learning (TDML) framework that leverages guidance to coordinate remote trainers and validate workloads.
arXiv Detail & Related papers (2024-07-10T03:22:28Z) - REFT: Resource-Efficient Federated Training Framework for Heterogeneous
and Resource-Constrained Environments [2.117841684082203]
Federated Learning (FL) plays a critical role in distributed systems.
FL emerges as a privacy-enforcing sub-domain of machine learning.
We propose "Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained Environments"
arXiv Detail & Related papers (2023-08-25T20:33:30Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs.
We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost.
Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.