HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments
- URL: http://arxiv.org/abs/2111.10635v4
- Date: Wed, 7 Jun 2023 13:33:11 GMT
- Title: HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments
- Authors: Ji Liu, Zhihua Wu, Dianhai Yu, Yanjun Ma, Danlei Feng, Minxu Zhang,
Xinxuan Wu, Xuefeng Yao, Dejing Dou
- Abstract summary: Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
- Score: 37.55572042288321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNNs) exploit many layers and a large number of
parameters to achieve excellent performance. The training process of DNN models
generally handles large-scale input data with many sparse features, which
incurs high Input/Output (IO) cost, while some layers are compute-intensive.
The training process generally exploits distributed computing resources to
reduce training time. In addition, heterogeneous computing resources, e.g.,
CPUs, GPUs of multiple types, are available for the distributed training
process. Thus, the scheduling of multiple layers to diverse computing resources
is critical for the training process. To efficiently train a DNN model using
the heterogeneous computing resources, we propose a distributed framework,
i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a
distributed architecture and a Reinforcement Learning (RL)-based scheduling
method. The advantages of Paddle-HeterPS are three-fold compared with existing
frameworks. First, Paddle-HeterPS enables efficient training process of diverse
workloads with heterogeneous computing resources. Second, Paddle-HeterPS
exploits an RL-based method to efficiently schedule the workload of each layer
to appropriate computing resources to minimize the cost while satisfying
throughput constraints. Third, Paddle-HeterPS manages data storage and data
communication among distributed computing resources. We carry out extensive
experiments to show that Paddle-HeterPS significantly outperforms
state-of-the-art approaches in terms of throughput (14.5 times higher) and
monetary cost (312.3% smaller). The codes of the framework are publicly
available at: https://github.com/PaddlePaddle/Paddle.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Exploring the Impact of Serverless Computing on Peer To Peer Training
Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training.
Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods.
Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z) - Taming Resource Heterogeneity In Distributed ML Training With Dynamic
Batching [1.047192732651018]
Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability.
We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
arXiv Detail & Related papers (2023-05-20T15:33:06Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - PiPar: Pipeline Parallelism for Collaborative Machine Learning [16.131285496487678]
Collaborative machine learning (CML) techniques have been proposed to train deep learning models across multiple mobile devices and a server.
CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server.
We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization.
arXiv Detail & Related papers (2022-12-01T20:51:47Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Doing More by Doing Less: How Structured Partial Backpropagation
Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources.
We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training.
We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.