Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs
- URL: http://arxiv.org/abs/2002.10105v1
- Date: Mon, 24 Feb 2020 07:50:56 GMT
- Title: Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs
- Authors: Qiang Wang, Shaohuai Shi, Canhui Wang, Xiaowen Chu
- Abstract summary: We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs)
We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job.
We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
- Score: 17.45154289084637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed Deep Learning (DDL) has rapidly grown its popularity since it
helps boost the training performance on high-performance GPU clusters.
Efficient job scheduling is indispensable to maximize the overall performance
of the cluster when training multiple jobs simultaneously. However, existing
schedulers do not consider the communication contention of multiple
communication tasks from different distributed training jobs, which could
deteriorate the system performance and prolong the job completion time. In this
paper, we first establish a new DDL job scheduling framework which organizes
DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication
contention between nodes. We then propose an efficient algorithm, LWF-$\kappa$,
to balance the GPU utilization and consolidate the allocated GPUs for each job.
When scheduling those communication tasks, we observe that neither avoiding all
the contention nor blindly accepting them is optimal to minimize the job
completion time. We thus propose a provable algorithm, AdaDUAL, to efficiently
schedule those communication tasks. Based on AdaDUAL, we finally propose
Ada-SRSF for the DDL job scheduling problem. Simulations on a 64-GPU cluster
connected with 10 Gbps Ethernet show that LWF-$\kappa$ achieves up to
$1.59\times$ improvement over the classical first-fit algorithms. More
importantly, Ada-SRSF reduces the average job completion time by $20.1\%$ and
$36.7\%$, as compared to the SRSF(1) scheme (avoiding all the contention) and
the SRSF(2) scheme (blindly accepting all of two-way communication contention)
respectively.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models.
Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - GPU Cluster Scheduling for Network-Sensitive Deep Learning [19.344426053952464]
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads.
Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling.
arXiv Detail & Related papers (2024-01-29T19:06:08Z) - A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR)
Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z) - FAMO: Fast Adaptive Multitask Optimization [48.59232177073481]
We introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way.
Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques.
arXiv Detail & Related papers (2023-06-06T15:39:54Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.