GPU Cluster Scheduling for Network-Sensitive Deep Learning
- URL: http://arxiv.org/abs/2401.16492v1
- Date: Mon, 29 Jan 2024 19:06:08 GMT
- Title: GPU Cluster Scheduling for Network-Sensitive Deep Learning
- Authors: Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T.
Kandemir, Chita R. Das
- Abstract summary: We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads.
Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling.
- Score: 19.344426053952464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads
that enables proximity based consolidation of GPU resources based on the DDL
jobs' sensitivities to the anticipated communication-network delays. Our
scheduler consists of three major components: (i) a classical delay scheduling
algorithm to facilitate job placement and consolidation; (ii) a
network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism
to optimize delay timers for effective delay scheduling. Additionally, to
enable a cost-effective methodology for large-scale experiments, we develop a
data-driven DDL cluster simulation platform. Employing the simulation platform
we compare against several state-of-the-art alternatives on real-world workload
traces to demonstrate the benefits of our design. Our scheduler can provide
improvement of up to 69% in end-to-end Makespan for training all jobs compared
to the prevailing consolidation-based scheduling methods, while reducing the
average job completion time by up to 83% and minimizing the communication
overheads by up to 98% under congested networking conditions.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - Scheduling Inference Workloads on Distributed Edge Clusters with
Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales.
By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP.
We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z) - Time-sensitive Learning for Heterogeneous Federated Edge Intelligence [52.83633954857744]
We investigate real-time machine learning in a federated edge intelligence (FEI) system.
FEI systems exhibit heterogenous communication and computational resource distribution.
We propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model.
arXiv Detail & Related papers (2023-01-26T08:13:22Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs [17.45154289084637]
We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs)
We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job.
We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
arXiv Detail & Related papers (2020-02-24T07:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.