Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks
- URL: http://arxiv.org/abs/2107.06533v1
- Date: Wed, 14 Jul 2021 08:01:07 GMT
- Title: Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks
- Authors: Shaohuai Shi, Lin Zhang, Bo Li
- Abstract summary: Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration.
We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
- Score: 13.552262050816616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed training with synchronous stochastic gradient descent (SGD) on
GPU clusters has been widely used to accelerate the training process of deep
models. However, SGD only utilizes the first-order gradient in model parameter
updates, which may take days or weeks. Recent studies have successfully
exploited approximate second-order information to speed up the training
process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges
as one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with distributed KFAC
(D-KFAC), it incurs extensive computation as well as introduces extra
communications during each iteration. In this work, we propose D-KFAC
(SPD-KFAC) with smart parallelism of computing and communication tasks to
reduce the iteration time. Specifically, 1) we first characterize the
performance bottlenecks of D-KFAC, 2) we design and implement a pipelining
mechanism for Kronecker factors computation and communication with dynamic
tensor fusion, and 3) we develop a load balancing placement for inverting
multiple matrices on GPU clusters. We conduct real-world experiments on a
64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show
that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over
state-of-the-art algorithms.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Accelerating Large Language Model Training with Hybrid GPU-based Compression [3.204387803072905]
MPI libraries have been proven to reduce message size significantly and leverage interconnect bandwidth.
We investigate the efficacy of compression-assisted MPI collectives under the context of distributed Large Language Model (LLM) training.
arXiv Detail & Related papers (2024-09-04T04:05:30Z) - DeAR: Accelerating Distributed Deep Learning with Fine-Grained
All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training.
We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations.
We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z) - Scalable K-FAC Training for Deep Neural Networks with Distributed
Preconditioning [19.04755792575149]
We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training.
DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
arXiv Detail & Related papers (2022-06-30T09:22:25Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.