Investigating Alternatives to the Root Mean Square for Adaptive Gradient
Methods
- URL: http://arxiv.org/abs/2106.05449v1
- Date: Thu, 10 Jun 2021 01:38:37 GMT
- Title: Investigating Alternatives to the Root Mean Square for Adaptive Gradient
Methods
- Authors: Brett Daley and Christopher Amato
- Abstract summary: Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance.
Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients.
We theoretically and empirically characterize the influence of different $Lp$ norms on adaptive gradient methods for the first time.
- Score: 20.531576904743282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adam is an adaptive gradient method that has experienced widespread adoption
due to its fast and reliable training performance. Recent approaches have not
offered significant improvement over Adam, often because they do not innovate
upon one of its core features: normalization by the root mean square (RMS) of
recent gradients. However, as noted by Kingma and Ba (2015), any number of
$L^p$ normalizations are possible, with the RMS corresponding to the specific
case of $p=2$. In our work, we theoretically and empirically characterize the
influence of different $L^p$ norms on adaptive gradient methods for the first
time. We show mathematically how the choice of $p$ influences the size of the
steps taken, while leaving other desirable properties unaffected. We evaluate
Adam with various $L^p$ norms on a suite of deep learning benchmarks, and find
that $p > 2$ consistently leads to improved learning speed and final
performance. The choices of $p=3$ or $p=6$ also match or outperform
state-of-the-art methods in all of our experiments.
Related papers
- ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate [21.378608502899077]
We propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $mathcalO without depending on the bounded noise assumption.
Our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning.
arXiv Detail & Related papers (2024-11-05T06:57:47Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning [16.824515577815696]
Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures.
We show that existing meta-gradient estimators adopted by GMRL are actually text-bfbiased.
We conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.
arXiv Detail & Related papers (2021-12-31T11:56:40Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - A new regret analysis for Adam-type algorithms [78.825194932103]
In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_1to0$ schedule.
We propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_1$.
arXiv Detail & Related papers (2020-03-21T19:19:51Z) - The Geometry of Sign Gradient Descent [29.8753797565422]
We show a close connection between separable smoothness and $ell_infty$-smoothness and argue that the latter is the weaker and more natural assumption.
We then proceed to study the smoothness constant with respect to the $ell_infty$-norm and thereby isolate geometric properties of the objective function.
In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue.
arXiv Detail & Related papers (2020-02-19T08:45:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.