Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep
Learning in a Supercomputing Environment
- URL: http://arxiv.org/abs/2209.08497v1
- Date: Sun, 18 Sep 2022 07:42:31 GMT
- Title: Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep
Learning in a Supercomputing Environment
- Authors: Daegun Yoon and Sangyoon Oh
- Abstract summary: gradient sparsification has been proposed to reduce the communication traffic significantly.
Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance.
We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To train deep learning models faster, distributed training on multiple GPUs
is the very popular scheme in recent years. However, the communication
bandwidth is still a major bottleneck of training performance. To improve
overall training performance, recent works have proposed gradient
sparsification methods that reduce the communication traffic significantly.
Most of them require gradient sorting to select meaningful gradients such as
Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to
increase the speed up overall training performance because gradient sorting is
significantly inefficient on GPUs. In this paper, we conduct experiments that
show the inefficiency of Top-k SGD and provide the insight of the low
performance. Based on observations from our empirical analysis, we plan to
yield a high performance gradient sparsification method as a future work.
Related papers
- Gradient Sparsification For Masked Fine-Tuning of Transformers [6.936564049727831]
Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks.
Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training.
It is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing.
arXiv Detail & Related papers (2023-07-19T16:13:13Z) - DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning.
We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers.
We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z) - Quantized Training of Gradient Boosting Decision Trees [84.97123593657584]
We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm.
With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits.
We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
arXiv Detail & Related papers (2022-07-20T06:27:06Z) - Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network.
We introduce a framework (textbfGCGD) to perform gradient correction.
Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.