Related papers: HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

URL: http://arxiv.org/abs/2111.10635v4
Date: Wed, 7 Jun 2023 13:33:11 GMT
Title: HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments
Authors: Ji Liu, Zhihua Wu, Dianhai Yu, Yanjun Ma, Danlei Feng, Minxu Zhang, Xinxuan Wu, Xuefeng Yao, Dejing Dou
Abstract summary: Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features. Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method. We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
Score: 37.55572042288321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.

Related papers

OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters [1.4131700241686853]
We develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneous resources. Our approach is inspired by proportional controllers to balance across heterogeneous servers, and works under varying resource availability.
arXiv Detail & Related papers (2025-03-21T18:26:24Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models. This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z)
Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training. Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z)
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching [1.047192732651018]
Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability. We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
arXiv Detail & Related papers (2023-05-20T15:33:06Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
PiPar: Pipeline Parallelism for Collaborative Machine Learning [16.131285496487678]
Collaborative machine learning (CML) techniques have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization.
arXiv Detail & Related papers (2022-12-01T20:51:47Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources. We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training. We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.