Related papers: Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters

Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters

URL: http://arxiv.org/abs/2108.03645v1
Date: Sun, 8 Aug 2021 14:20:05 GMT
Title: Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Authors: Zhengda Bian and Shenggui Li and Wei Wang and Yang You
Abstract summary: We propose ONES, an ONline Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size. We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
Score: 10.395955671683245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.

Related papers

ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning [54.08906841213777]
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times. We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
arXiv Detail & Related papers (2025-02-02T12:22:26Z)
CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning [0.0]
In this work, we employ an automatic approach to optimize GPU SASS schedules. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. Experiments show that CuAsmRL can further improve the performance of existing kernels by up $26%$, and on average $9%$.
arXiv Detail & Related papers (2025-01-14T12:36:18Z)
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU. A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
RESPECT: Reinforcement Learning based Edge Scheduling on Pipelined Coral Edge TPUs [12.952987240366781]
This work presents a reinforcement learning (RL) based scheduling framework, which learns the behaviors of optimal optimization algorithms. RL generates near-optimal scheduling results with short solving runtime overhead. Our framework has demonstrated up to $sim2.5times$ real-world on-chip runtime inference speedups over the commercial compiler.
arXiv Detail & Related papers (2023-04-10T17:22:12Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration. HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler. Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z)
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z)
Efficient Strong Scaling Through Burst Parallel Training [13.656104138147967]
Using large GPU clusters to train deep neural network (DNN) models is becoming an essential requirement. We present DeepPool, a system that addresses this efficiency challenge through two key ideas.
arXiv Detail & Related papers (2021-12-19T05:18:39Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z)
Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms. We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs [17.45154289084637]
We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job. We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
arXiv Detail & Related papers (2020-02-24T07:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.