Taming Resource Heterogeneity In Distributed ML Training With Dynamic
Batching
- URL: http://arxiv.org/abs/2305.12213v1
- Date: Sat, 20 May 2023 15:33:06 GMT
- Title: Taming Resource Heterogeneity In Distributed ML Training With Dynamic
Batching
- Authors: Sahil Tyagi and Prateek Sharma
- Abstract summary: Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability.
We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
- Score: 1.047192732651018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current techniques and systems for distributed model training mostly assume
that clusters are comprised of homogeneous servers with a constant resource
availability. However, cluster heterogeneity is pervasive in computing
infrastructure, and is a fundamental characteristic of low-cost transient
resources (such as EC2 spot instances). In this paper, we develop a dynamic
batching technique for distributed data-parallel training that adjusts the
mini-batch sizes on each worker based on its resource availability and
throughput. Our mini-batch controller seeks to equalize iteration times on all
workers, and facilitates training on clusters comprised of servers with
different amounts of CPU and GPU resources. This variable mini-batch technique
uses proportional control and ideas from PID controllers to find stable
mini-batch sizes. Our empirical evaluation shows that dynamic batching can
reduce model training times by more than 4x on heterogeneous clusters.
Related papers
- Equitable-FL: Federated Learning with Sparsity for Resource-Constrained
Environment [10.980548731600116]
We propose a sparse form of federated learning that performs well in a Resource Constrained Environment.
Our goal is to make learning possible, regardless of a node's space, computing, or bandwidth scarcity.
Results obtained from experiments performed for training convolutional neural networks validate the efficacy of Equitable-FL.
arXiv Detail & Related papers (2023-09-02T08:40:17Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - PiPar: Pipeline Parallelism for Collaborative Machine Learning [16.131285496487678]
Collaborative machine learning (CML) techniques have been proposed to train deep learning models across multiple mobile devices and a server.
CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server.
We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization.
arXiv Detail & Related papers (2022-12-01T20:51:47Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs.
We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost.
Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z) - Doing More by Doing Less: How Structured Partial Backpropagation
Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources.
We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training.
We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z) - HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z) - Clairvoyant Prefetching for Distributed Machine Learning I/O [9.490118207943192]
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers.
We produce a novel machine learning I/O, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, scalable solution that delivers better performance than state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-21T17:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.