AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs
- URL: http://arxiv.org/abs/2210.06364v1
- Date: Wed, 12 Oct 2022 16:17:25 GMT
- Title: AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs
- Authors: Shiv Ram Dubey, Satish Kumar Singh, Bidyut Baran Chaudhuri
- Abstract summary: gradient descent (SGD)s are generally used to train the convolutional neural networks (CNNs)
Existing SGDs do not exploit the gradient norm of past iterations and lead to poor convergence and performance.
We propose a novel AdaNorm based SGDs by correcting the norm of gradient in each iteration based on the adaptive training history of gradient norm.
- Score: 23.523389372182613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The stochastic gradient descent (SGD) optimizers are generally used to train
the convolutional neural networks (CNNs). In recent years, several adaptive
momentum based SGD optimizers have been introduced, such as Adam, diffGrad,
Radam and AdaBelief. However, the existing SGD optimizers do not exploit the
gradient norm of past iterations and lead to poor convergence and performance.
In this paper, we propose a novel AdaNorm based SGD optimizers by correcting
the norm of gradient in each iteration based on the adaptive training history
of gradient norm. By doing so, the proposed optimizers are able to maintain
high and representive gradient throughout the training and solves the low and
atypical gradient problems. The proposed concept is generic and can be used
with any existing SGD optimizer. We show the efficacy of the proposed AdaNorm
with four state-of-the-art optimizers, including Adam, diffGrad, Radam and
AdaBelief. We depict the performance improvement due to the proposed optimizers
using three CNN models, including VGG16, ResNet18 and ResNet50, on three
benchmark object recognition datasets, including CIFAR10, CIFAR100 and
TinyImageNet. Code: \url{https://github.com/shivram1987/AdaNorm}.
Related papers
- Variational Stochastic Gradient Descent for Deep Neural Networks [16.96187187108041]
Current state-of-the-arts are adaptive gradient-based optimization methods such as Adam.
Here, we propose to combine both approaches, resulting in the Variational Gradient Descent (VSGD)
We show how our VSGD method relates to other adaptive gradient-baseds like Adam.
arXiv Detail & Related papers (2024-04-09T18:02:01Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance [2.71467552808655]
We exploit the layerwise statistics to suppress the range of the adaptive stepsizes of Adam.
The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations.
SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.
arXiv Detail & Related papers (2023-02-02T11:46:23Z) - Moment Centralization based Gradient Descent Optimizers for
Convolutional Neural Networks [12.90962626557934]
Conal neural networks (CNNs) have shown very appealing performance for many computer vision applications.
In this paper, we propose a moment centralization-based SGD datasets for CNNs.
The proposed moment centralization is generic in nature and can be integrated with any of the existing adaptive momentum-baseds.
arXiv Detail & Related papers (2022-07-19T04:38:01Z) - A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.