A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning
- URL: http://arxiv.org/abs/2206.02034v2
- Date: Sat, 19 Aug 2023 13:42:29 GMT
- Title: A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning
- Authors: Kushal Chakrabarti and Nikhil Chopra
- Abstract summary: Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
- Score: 0.6526824510982802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods have become popular in optimizing deep neural
networks; recent examples include AdaGrad and Adam. Although Adam usually
converges faster, variations of Adam, for instance, the AdaBelief algorithm,
have been proposed to enhance Adam's poor generalization ability compared to
the classical stochastic gradient method. This paper develops a generic
framework for adaptive gradient methods that solve non-convex optimization
problems. We first model the adaptive gradient methods in a state-space
framework, which allows us to present simpler convergence proofs of adaptive
optimizers such as AdaGrad, Adam, and AdaBelief. We then utilize the transfer
function paradigm from classical control theory to propose a new variant of
Adam, coined AdamSSM. We add an appropriate pole-zero pair in the transfer
function from squared gradients to the second moment estimate. We prove the
convergence of the proposed AdamSSM algorithm. Applications on benchmark
machine learning tasks of image classification using CNN architectures and
language modeling using LSTM architecture demonstrate that the AdamSSM
algorithm improves the gap between generalization accuracy and faster
convergence than the recent adaptive gradient methods.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Improving the Adaptive Moment Estimation (ADAM) stochastic optimizer through an Implicit-Explicit (IMEX) time-stepping approach [1.2233362977312945]
The classical Adam algorithm is a first-order implicit-explicit (IMEX) discretization of the underlying ODE.
We propose new extensions of the Adam scheme obtained by using higher-order IMEX methods to solve the ODE.
We derive a new optimization algorithm for neural network training that performs better than classical Adam on several regression and classification problems.
arXiv Detail & Related papers (2024-03-20T16:08:27Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - CADA: Communication-Adaptive Distributed Adam [31.02472517086767]
gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning.
This paper proposes an adaptive gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method.
arXiv Detail & Related papers (2020-12-31T06:52:18Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.