Aryl: An Elastic Cluster Scheduler for Deep Learning
- URL: http://arxiv.org/abs/2202.07896v1
- Date: Wed, 16 Feb 2022 07:03:25 GMT
- Title: Aryl: An Elastic Cluster Scheduler for Deep Learning
- Authors: Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, Cong Wang
- Abstract summary: We introduce Aryl, a new cluster scheduler to address problems for both training and inference.
Aryl introduces capacity loaning to loan idle inference servers for training jobs.
It improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.
- Score: 12.942546041713596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Companies build separate training and inference GPU clusters for deep
learning, and use separate schedulers to manage them. This leads to problems
for both training and inference: inference clusters have low GPU utilization
when the traffic load is low; training jobs often experience long queueing time
due to lack of resources. We introduce Aryl, a new cluster scheduler to address
these problems. Aryl introduces capacity loaning to loan idle inference GPU
servers for training jobs. It further exploits elastic scaling that scales a
training job's GPU allocation to better utilize loaned resources. Capacity
loaning and elastic scaling create new challenges to cluster management. When
the loaned servers need to be returned, we need to minimize the number of job
preemptions; when more GPUs become available, we need to allocate them to
elastic jobs and minimize the job completion time (JCT). Aryl addresses these
combinatorial problems using principled heuristics. It introduces the notion of
server preemption cost which it greedily reduces during server reclaiming. It
further relies on the JCT reduction value defined for each additional worker
for an elastic job to solve the scheduling problem as a multiple-choice
knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale
simulation with 15-day traces of over 50,000 production jobs show that Aryl
brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves
cluster usage by up to 26.9% over the cluster scheduler without capacity
loaning or elastic scaling.
Related papers
- FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - A Simulation Platform for Multi-tenant Machine Learning Services on
Thousands of GPUs [38.92672037891692]
AnalySIM is a cluster simulator that allows efficient design explorations for multi-tenant machine learning services.
It can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization.
We find that preemption and migration are able to significantly reduce average job completion time.
arXiv Detail & Related papers (2022-01-10T06:00:11Z) - Efficient Strong Scaling Through Burst Parallel Training [13.656104138147967]
Using large GPU clusters to train deep neural network (DNN) models is becoming an essential requirement.
We present DeepPool, a system that addresses this efficiency challenge through two key ideas.
arXiv Detail & Related papers (2021-12-19T05:18:39Z) - Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers.
We propose Synergy, a resource-sensitive scheduler for shared GPU clusters.
Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z) - Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z) - Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs [17.45154289084637]
We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs)
We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job.
We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
arXiv Detail & Related papers (2020-02-24T07:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.