Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters
- URL: http://arxiv.org/abs/2108.03645v1
- Date: Sun, 8 Aug 2021 14:20:05 GMT
- Title: Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters
- Authors: Zhengda Bian and Shenggui Li and Wei Wang and Yang You
- Abstract summary: We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
- Score: 10.395955671683245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient GPU resource scheduling is essential to maximize resource
utilization and save training costs for the increasing amount of deep learning
workloads in shared GPU clusters. Existing GPU schedulers largely rely on
static policies to leverage the performance characteristics of deep learning
jobs. However, they can hardly reach optimal efficiency due to the lack of
elasticity. To address the problem, we propose ONES, an ONline Evolutionary
Scheduler for elastic batch size orchestration. ONES automatically manages the
elasticity of each job based on the training batch size, so as to maximize GPU
utilization and improve scheduling efficiency. It determines the batch size for
each job through an online evolutionary search that can continuously optimize
the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on
TACC's Longhorn supercomputers. The results show that ONES can outperform the
prior deep learning schedulers with a significantly shorter average job
completion time.
Related papers
- ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning [54.08906841213777]
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning.
We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times.
We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
arXiv Detail & Related papers (2025-02-02T12:22:26Z) - CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning [0.0]
In this work, we employ an automatic approach to optimize GPU SASS schedules.
The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling.
Experiments show that CuAsmRL can further improve the performance of existing kernels by up $26%$, and on average $9%$.
arXiv Detail & Related papers (2025-01-14T12:36:18Z) - Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.
By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.
A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler
for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration.
HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler.
Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z) - Efficient Strong Scaling Through Burst Parallel Training [13.656104138147967]
Using large GPU clusters to train deep neural network (DNN) models is becoming an essential requirement.
We present DeepPool, a system that addresses this efficiency challenge through two key ideas.
arXiv Detail & Related papers (2021-12-19T05:18:39Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z) - Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.