Related papers: AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

URL: http://arxiv.org/abs/2004.09740v2
Date: Mon, 4 May 2020 21:05:58 GMT
Title: AdaX: Adaptive Gradient Descent with Exponential Long Term Memory
Authors: Wenjie Li, Zhaoyang Zhang, Xinjiang Wang, Ping Luo
Abstract summary: We analyze a problem of Adam by analyzing its performance in simple non-vision machine learning tasks. We propose a novel adaptive gradient named AdaX to solve the problem. AdaX outperforms Adam in various computer natural language processing tasks.
Score: 34.6432726391469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.

Related papers

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z)
AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results. We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z)
StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling [0.0]
We introduce StochGradAdam, a novel extension of the Adam algorithm, incorporating gradient sampling techniques. StochGradAdam achieves comparable or superior performance to Adam, even when using fewer gradient updates per iteration. The results suggest that this approach is particularly effective for large-scale models and datasets.
arXiv Detail & Related papers (2023-10-25T22:45:31Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective [0.0]
We propose a new fast, Generalized AdaGrad (G-AdaGrad) for solving non machine learning problems. Specifically, we adopt a state-space perspective for analyzing the convergence acceleration algorithms, namely G-AdaGrad and Adam.
arXiv Detail & Related papers (2021-05-31T20:30:25Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.