Tom: Leveraging trend of the observed gradients for faster convergence
- URL: http://arxiv.org/abs/2109.03820v1
- Date: Tue, 7 Sep 2021 20:19:40 GMT
- Title: Tom: Leveraging trend of the observed gradients for faster convergence
- Authors: Anirudh Maiya, Inumella Sricharan, Anshuman Pandey, Srinivas K. S
- Abstract summary: Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep learning can be attributed to various factors such as
increase in computational power, large datasets, deep convolutional neural
networks, optimizers etc. Particularly, the choice of optimizer affects the
generalization, convergence rate, and training stability. Stochastic Gradient
Descent (SGD) is a first order iterative optimizer that updates the gradient
uniformly for all parameters. This uniform update may not be suitable across
the entire training phase. A rudimentary solution for this is to employ a
fine-tuned learning rate scheduler which decreases learning rate as a function
of iteration. To eliminate the dependency of learning rate schedulers, adaptive
gradient optimizers such as AdaGrad, AdaDelta, RMSProp, Adam employ a
parameter-wise scaling term for learning rate which is a function of the
gradient itself. We propose Tom (Trend over Momentum) optimizer, which is a
novel variant of Adam that takes into account of the trend which is observed
for the gradients in the loss landscape traversed by the neural network. In the
proposed Tom optimizer, an additional smoothing equation is introduced to
address the trend observed during the process of optimization. The smoothing
parameter introduced for the trend requires no tuning and can be used with
default values. Experimental results for classification datasets such as
CIFAR-10, CIFAR-100 and CINIC-10 image datasets show that Tom outperforms
Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster
convergence. The source code is publicly made available at
https://github.com/AnirudhMaiya/Tom
Related papers
- Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks.
Approximate MU is a practical method for large-scale models.
We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Step-size Adaptation Using Exponentiated Gradient Updates [21.162404996362948]
We show that augmenting a given with an adaptive tuning method of the step-size greatly improves the performance.
We maintain a global step-size scale for the update as well as a gain factor for each coordinate.
We show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule.
arXiv Detail & Related papers (2022-01-31T23:17:08Z) - Comparing Classes of Estimators: When does Gradient Descent Beat Ridge
Regression in Linear Models? [46.01087792062936]
We compare classes of estimators via the relative performance of the emphbest method in the class
This allows us to rigorously quantify the tuning sensitivity of learning algorithms.
arXiv Detail & Related papers (2021-08-26T16:01:37Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.