Related papers: Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

URL: http://arxiv.org/abs/2209.08497v1
Date: Sun, 18 Sep 2022 07:42:31 GMT
Title: Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment
Authors: Daegun Yoon and Sangyoon Oh
Abstract summary: gradient sparsification has been proposed to reduce the communication traffic significantly. Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance. We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
Score: 0.6091702876917281
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training performance, recent works have proposed gradient sparsification methods that reduce the communication traffic significantly. Most of them require gradient sorting to select meaningful gradients such as Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to increase the speed up overall training performance because gradient sorting is significantly inefficient on GPUs. In this paper, we conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance. Based on observations from our empirical analysis, we plan to yield a high performance gradient sparsification method as a future work.

Related papers

Fast and Slow Gradient Approximation for Binary Neural Network Optimization [11.064044986709733]
hypernetwork based methods utilize neural networks to learn the gradients of non-differentiable quantization functions. We propose a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. We also introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients.
arXiv Detail & Related papers (2024-12-16T13:48:40Z)
Gradient Sparsification For Masked Fine-Tuning of Transformers [6.936564049727831]
Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks. Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training. It is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing.
arXiv Detail & Related papers (2023-07-19T16:13:13Z)
DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z)
Quantized Training of Gradient Boosting Decision Trees [84.97123593657584]
We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
arXiv Detail & Related papers (2022-07-20T06:27:06Z)
Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network. We introduce a framework (textbfGCGD) to perform gradient correction. Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.