Related papers: Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

URL: http://arxiv.org/abs/2106.05449v1
Date: Thu, 10 Jun 2021 01:38:37 GMT
Title: Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods
Authors: Brett Daley and Christopher Amato
Abstract summary: Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance. Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients. We theoretically and empirically characterize the influence of different $Lp$ norms on adaptive gradient methods for the first time.
Score: 20.531576904743282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance. Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients. However, as noted by Kingma and Ba (2015), any number of $L^p$ normalizations are possible, with the RMS corresponding to the specific case of $p=2$. In our work, we theoretically and empirically characterize the influence of different $L^p$ norms on adaptive gradient methods for the first time. We show mathematically how the choice of $p$ influences the size of the steps taken, while leaving other desirable properties unaffected. We evaluate Adam with various $L^p$ norms on a suite of deep learning benchmarks, and find that $p > 2$ consistently leads to improved learning speed and final performance. The choices of $p=3$ or $p=6$ also match or outperform state-of-the-art methods in all of our experiments.

Related papers

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate [21.378608502899077]
We propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $mathcalO without depending on the bounded noise assumption. Our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning.
arXiv Detail & Related papers (2024-11-05T06:57:47Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$. AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$. We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z)
A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning [16.824515577815696]
Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures. We show that existing meta-gradient estimators adopted by GMRL are actually text-bfbiased. We conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.
arXiv Detail & Related papers (2021-12-31T11:56:40Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
Private Stochastic Non-Convex Optimization: Adaptive Algorithms and Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization. We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
A new regret analysis for Adam-type algorithms [78.825194932103]
In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_1to0$ schedule. We propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_1$.
arXiv Detail & Related papers (2020-03-21T19:19:51Z)
The Geometry of Sign Gradient Descent [29.8753797565422]
We show a close connection between separable smoothness and $ell_infty$-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $ell_infty$-norm and thereby isolate geometric properties of the objective function. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue.
arXiv Detail & Related papers (2020-02-19T08:45:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.