Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters
- URL: http://arxiv.org/abs/2108.03645v1
- Date: Sun, 8 Aug 2021 14:20:05 GMT
- Title: Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters
- Authors: Zhengda Bian and Shenggui Li and Wei Wang and Yang You
- Abstract summary: We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
- Score: 10.395955671683245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient GPU resource scheduling is essential to maximize resource
utilization and save training costs for the increasing amount of deep learning
workloads in shared GPU clusters. Existing GPU schedulers largely rely on
static policies to leverage the performance characteristics of deep learning
jobs. However, they can hardly reach optimal efficiency due to the lack of
elasticity. To address the problem, we propose ONES, an ONline Evolutionary
Scheduler for elastic batch size orchestration. ONES automatically manages the
elasticity of each job based on the training batch size, so as to maximize GPU
utilization and improve scheduling efficiency. It determines the batch size for
each job through an online evolutionary search that can continuously optimize
the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on
TACC's Longhorn supercomputers. The results show that ONES can outperform the
prior deep learning schedulers with a significantly shorter average job
completion time.
Related papers
- FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - RESPECT: Reinforcement Learning based Edge Scheduling on Pipelined Coral
Edge TPUs [12.952987240366781]
This work presents a reinforcement learning (RL) based scheduling framework, which learns the behaviors of optimal optimization algorithms.
RL generates near-optimal scheduling results with short solving runtime overhead.
Our framework has demonstrated up to $sim2.5times$ real-world on-chip runtime inference speedups over the commercial compiler.
arXiv Detail & Related papers (2023-04-10T17:22:12Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler
for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration.
HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler.
Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Efficient Strong Scaling Through Burst Parallel Training [13.656104138147967]
Using large GPU clusters to train deep neural network (DNN) models is becoming an essential requirement.
We present DeepPool, a system that addresses this efficiency challenge through two key ideas.
arXiv Detail & Related papers (2021-12-19T05:18:39Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z) - Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z) - Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs [17.45154289084637]
We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs)
We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job.
We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
arXiv Detail & Related papers (2020-02-24T07:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.