HeterPS: Distributed Deep Learning With Reinforcement Learning Based
  Scheduling in Heterogeneous Environments
        - URL: http://arxiv.org/abs/2111.10635v4
- Date: Wed, 7 Jun 2023 13:33:11 GMT
- Title: HeterPS: Distributed Deep Learning With Reinforcement Learning Based
  Scheduling in Heterogeneous Environments
- Authors: Ji Liu, Zhihua Wu, Dianhai Yu, Yanjun Ma, Danlei Feng, Minxu Zhang,
  Xinxuan Wu, Xuefeng Yao, Dejing Dou
- Abstract summary: Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
- Score: 37.55572042288321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Deep neural networks (DNNs) exploit many layers and a large number of
parameters to achieve excellent performance. The training process of DNN models
generally handles large-scale input data with many sparse features, which
incurs high Input/Output (IO) cost, while some layers are compute-intensive.
The training process generally exploits distributed computing resources to
reduce training time. In addition, heterogeneous computing resources, e.g.,
CPUs, GPUs of multiple types, are available for the distributed training
process. Thus, the scheduling of multiple layers to diverse computing resources
is critical for the training process. To efficiently train a DNN model using
the heterogeneous computing resources, we propose a distributed framework,
i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a
distributed architecture and a Reinforcement Learning (RL)-based scheduling
method. The advantages of Paddle-HeterPS are three-fold compared with existing
frameworks. First, Paddle-HeterPS enables efficient training process of diverse
workloads with heterogeneous computing resources. Second, Paddle-HeterPS
exploits an RL-based method to efficiently schedule the workload of each layer
to appropriate computing resources to minimize the cost while satisfying
throughput constraints. Third, Paddle-HeterPS manages data storage and data
communication among distributed computing resources. We carry out extensive
experiments to show that Paddle-HeterPS significantly outperforms
state-of-the-art approaches in terms of throughput (14.5 times higher) and
monetary cost (312.3% smaller). The codes of the framework are publicly
available at: https://github.com/PaddlePaddle/Paddle.
 
      
        Related papers
        - OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous   Clusters [1.4131700241686853]
 We develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneous resources.
Our approach is inspired by proportional controllers to balance across heterogeneous servers, and works under varying resource availability.
 arXiv  Detail & Related papers  (2025-03-21T18:26:24Z)
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
 Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
 arXiv  Detail & Related papers  (2025-02-19T14:58:48Z)
- TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
 Deep learning training is a repetitive and resource-intensive process.<n>In this paper, we presentSocket to reduce the computational needs of training by enabling simultaneous training processes to share the same data loader.<n>Our evaluation shows that colSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50%.
 arXiv  Detail & Related papers  (2024-09-27T13:39:47Z)
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
 GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
 arXiv  Detail & Related papers  (2024-03-17T13:06:29Z)
- Exploring the Impact of Serverless Computing on Peer To Peer Training
  Machine Learning [0.3441021278275805]
 We introduce a novel architecture that combines serverless computing with P2P networks for distributed training.
Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods.
Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
 arXiv  Detail & Related papers  (2023-09-25T13:51:07Z)
- Taming Resource Heterogeneity In Distributed ML Training With Dynamic
  Batching [1.047192732651018]
 Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability.
We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
 arXiv  Detail & Related papers  (2023-05-20T15:33:06Z)
- Partitioning Distributed Compute Jobs with Reinforcement Learning and
  Graph Neural Networks [58.720142291102135]
 Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
 arXiv  Detail & Related papers  (2023-01-31T17:41:07Z)
- PiPar: Pipeline Parallelism for Collaborative Machine Learning [16.131285496487678]
 Collaborative machine learning (CML) techniques have been proposed to train deep learning models across multiple mobile devices and a server.
CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server.
We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization.
 arXiv  Detail & Related papers  (2022-12-01T20:51:47Z)
- Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
 Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
 arXiv  Detail & Related papers  (2022-11-19T15:44:08Z)
- Doing More by Doing Less: How Structured Partial Backpropagation
  Improves Deep Learning Clusters [9.17259958324486]
 Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources.
We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training.
We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
 arXiv  Detail & Related papers  (2021-11-20T20:34:26Z)
- Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
  Learning [61.29990368322931]
 Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
 arXiv  Detail & Related papers  (2020-08-27T16:56:48Z)
- Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
  Partitioned Edge Learning [73.82875010696849]
 Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
 arXiv  Detail & Related papers  (2020-03-10T05:52:15Z)
- Large-Scale Gradient-Free Deep Learning with Recursive Local
  Representation Alignment [84.57874289554839]
 Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
 arXiv  Detail & Related papers  (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.