Balance is Essence: Accelerating Sparse Training via Adaptive Gradient
Correction
- URL: http://arxiv.org/abs/2301.03573v2
- Date: Tue, 5 Dec 2023 16:05:00 GMT
- Title: Balance is Essence: Accelerating Sparse Training via Adaptive Gradient
Correction
- Authors: Bowen Lei, Dongkuan Xu, Ruqi Zhang, Shuren He, Bani K. Mallick
- Abstract summary: Deep neural networks require significant memory and computation costs.
Sparse training is one of the most common techniques to reduce these costs.
In this work, we aim to overcome this problem and achieve space-time co-efficiency.
- Score: 29.61757744974324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite impressive performance, deep neural networks require significant
memory and computation costs, prohibiting their application in
resource-constrained scenarios. Sparse training is one of the most common
techniques to reduce these costs, however, the sparsity constraints add
difficulty to the optimization, resulting in an increase in training time and
instability. In this work, we aim to overcome this problem and achieve
space-time co-efficiency. To accelerate and stabilize the convergence of sparse
training, we analyze the gradient changes and develop an adaptive gradient
correction method. Specifically, we approximate the correlation between the
current and previous gradients, which is used to balance the two gradients to
obtain a corrected gradient. Our method can be used with the most popular
sparse training pipelines under both standard and adversarial setups.
Theoretically, we prove that our method can accelerate the convergence rate of
sparse training. Extensive experiments on multiple datasets, model
architectures, and sparsities demonstrate that our method outperforms leading
sparse training methods by up to \textbf{5.0\%} in accuracy given the same
number of training epochs, and reduces the number of training epochs by up to
\textbf{52.1\%} to achieve the same accuracy. Our code is available on:
\url{https://github.com/StevenBoys/AGENT}.
Related papers
- MetaGrad: Adaptive Gradient Quantization with Hypernetworks [46.55625589293897]
Quantization aware Training (QAT) accelerates the forward pass during the neural network training and inference.
In this work, we propose to solve this problem by incorporating the gradients into the computation graph of the next training via a hypernetwork.
Various experiments on CIFAR-10 dataset with different CNN network architectures demonstrate that our hypernetwork-based approach can effectively reduce the negative effect of gradient quantization noise.
arXiv Detail & Related papers (2023-03-04T07:26:34Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Efficient Neural Network Training via Forward and Backward Propagation
Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes.
Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z) - Adaptive Learning Rate and Momentum for Training Deep Neural Networks [0.0]
We develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework.
Experiments in image classification datasets show that our method yields faster convergence than other local solvers.
arXiv Detail & Related papers (2021-06-22T05:06:56Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Step-Ahead Error Feedback for Distributed Training with Compressed
Gradient [99.42912552638168]
We show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training.
We propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis.
arXiv Detail & Related papers (2020-08-13T11:21:07Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.