Asymmetric Momentum: A Rethinking of Gradient Descent
- URL: http://arxiv.org/abs/2309.02130v2
- Date: Tue, 3 Oct 2023 04:47:17 GMT
- Title: Asymmetric Momentum: A Rethinking of Gradient Descent
- Authors: Gongyue Zhang, Dinghuang Zhang, Shuwen Zhao, Donghan Liu, Carrie M.
Toptan and Honghai Liu
- Abstract summary: We propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM)
By averaging the loss, we divide training process into different loss phases and using different momentum.
We experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset.
- Score: 4.1001738811512345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Through theoretical and experimental validation, unlike all existing adaptive
methods like Adam which penalize frequently-changing parameters and are only
applicable to sparse gradients, we propose the simplest SGD enhanced method,
Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide
training process into different loss phases and using different momentum. It
not only can accelerates slow-changing parameters for sparse gradients, similar
to adaptive optimizers, but also can choose to accelerates frequently-changing
parameters for non-sparse gradients, thus being adaptable to all types of
datasets. We reinterpret the machine learning training process through the
concepts of weight coupling and weight traction, and experimentally validate
that weights have directional specificity, which are correlated with the
specificity of the dataset. Thus interestingly, we observe that in non-sparse
gradients, frequently-changing parameters should actually be accelerated, which
is completely opposite to traditional adaptive perspectives. Compared to
traditional SGD with momentum, this algorithm separates the weights without
additional computational costs. It is noteworthy that this method relies on the
network's ability to extract complex features. We primarily use Wide Residual
Networks for our research, employing the classic datasets Cifar10 and Cifar100
to test the ability for feature separation and conclude phenomena that are much
more important than just accuracy rates. Finally, compared to classic SGD
tuning methods, while using WRN on these two datasets and with nearly half the
training epochs, we achieve equal or better test accuracy.
Related papers
- Gradient-free variational learning with conditional mixture networks [39.827869318925494]
Conditional mixture networks (CMNs) are suitable for fast, gradient-free inference and can solve complex classification tasks.
We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository.
Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation.
arXiv Detail & Related papers (2024-08-29T10:43:55Z) - Parallel and Limited Data Voice Conversion Using Stochastic Variational
Deep Kernel Learning [2.5782420501870296]
This paper proposes a voice conversion method that works with limited data.
It is based on variational deep kernel learning (SVDKL)
It is possible to estimate non-smooth and more complex functions.
arXiv Detail & Related papers (2023-09-08T16:32:47Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Preprint: Norm Loss: An efficient yet effective regularization method
for deep neural networks [7.214681039134488]
We propose a weight soft-regularization method based on the oblique manifold.
We evaluate our method on the popular CIFAR-10, CIFAR-100 and ImageNet 2012 datasets.
arXiv Detail & Related papers (2021-03-11T10:24:49Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.