Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective
- URL: http://arxiv.org/abs/2106.00092v1
- Date: Mon, 31 May 2021 20:30:25 GMT
- Title: Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective
- Authors: Kushal Chakrabarti, Nikhil Chopra
- Abstract summary: We propose a new fast, Generalized AdaGrad (G-AdaGrad) for solving non machine learning problems.
Specifically, we adopt a state-space perspective for analyzing the convergence acceleration algorithms, namely G-AdaGrad and Adam.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerated gradient-based methods are being extensively used for solving
non-convex machine learning problems, especially when the data points are
abundant or the available data is distributed across several agents. Two of the
prominent accelerated gradient algorithms are AdaGrad and Adam. AdaGrad is the
simplest accelerated gradient method, which is particularly effective for
sparse data. Adam has been shown to perform favorably in deep learning problems
compared to other methods. In this paper, we propose a new fast optimizer,
Generalized AdaGrad (G-AdaGrad), for accelerating the solution of potentially
non-convex machine learning problems. Specifically, we adopt a state-space
perspective for analyzing the convergence of gradient acceleration algorithms,
namely G-AdaGrad and Adam, in machine learning. Our proposed state-space models
are governed by ordinary differential equations. We present simple convergence
proofs of these two algorithms in the deterministic settings with minimal
assumptions. Our analysis also provides intuition behind improving upon
AdaGrad's convergence rate. We provide empirical results on MNIST dataset to
reinforce our claims on the convergence and performance of G-AdaGrad and Adam.
Related papers
- Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees.
We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z) - Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad [16.249992982986956]
This paper introduces a novel adaptive algorithm named KATE which consistently matches complex machine learning tasks.
We compare KATE to other state-of-the-art adaptive algorithms Adam AdaGrad in numerical experiments with different problems.
arXiv Detail & Related papers (2024-03-05T04:35:59Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Fast Learning of Graph Neural Networks with Guaranteed Generalizability:
One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice.
We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z) - AdaX: Adaptive Gradient Descent with Exponential Long Term Memory [34.6432726391469]
We analyze a problem of Adam by analyzing its performance in simple non-vision machine learning tasks.
We propose a novel adaptive gradient named AdaX to solve the problem.
AdaX outperforms Adam in various computer natural language processing tasks.
arXiv Detail & Related papers (2020-04-21T03:46:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.