A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum
Acceleration
- URL: http://arxiv.org/abs/1808.03408v4
- Date: Mon, 15 May 2023 13:24:07 GMT
- Title: A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum
Acceleration
- Authors: Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun and Wei Liu
- Abstract summary: Integrating adaptive learning rate.
techniques into SGD leads to a large.
efficiently accelerated adaptive algorithms, such as AdaGrad, Adam.
AdaProp, Adam, AccAdaProp, Adam, RMSTOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
- Score: 21.929334023875874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating adaptive learning rate and momentum techniques into SGD leads to
a large class of efficiently accelerated adaptive stochastic algorithms, such
as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}. In spite of their
effectiveness in practice, there is still a large gap in their theories of
convergences, especially in the difficult non-convex stochastic setting. To
fill this gap, we propose \emph{weighted AdaGrad with unified momentum}, dubbed
AdaUSM, which has the main characteristics that (1) it incorporates a unified
momentum scheme which covers both the heavy ball momentum and the Nesterov
accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning
rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and
RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we
obtain its $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the non-convex
stochastic setting. We also show that the adaptive learning rates of Adam and
RMSProp correspond to taking exponentially growing weights in AdaUSM, thereby
providing a new perspective for understanding Adam and RMSProp. Lastly,
comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA,
Adam, and AMSGrad on various deep learning models and datasets are also carried
out.
Related papers
- In Search of Adam's Secret Sauce [11.215133680044005]
We train over 1,300 language models across different data configurations and scales.<n>We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam.<n>We show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients.
arXiv Detail & Related papers (2025-05-27T23:30:18Z) - AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z) - Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees.
We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - A Multi-Grained Symmetric Differential Equation Model for Learning Protein-Ligand Binding Dynamics [73.35846234413611]
In drug discovery, molecular dynamics (MD) simulation provides a powerful tool for predicting binding affinities, estimating transport properties, and exploring pocket sites.
We propose NeuralMD, the first machine learning (ML) surrogate that can facilitate numerical MD and provide accurate simulations in protein-ligand binding dynamics.
We demonstrate the efficiency and effectiveness of NeuralMD, achieving over 1K$times$ speedup compared to standard numerical MD simulations.
arXiv Detail & Related papers (2024-01-26T09:35:17Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials.
To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization [20.782428252187024]
We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2021-10-18T07:03:48Z) - Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective [0.0]
We propose a new fast, Generalized AdaGrad (G-AdaGrad) for solving non machine learning problems.
Specifically, we adopt a state-space perspective for analyzing the convergence acceleration algorithms, namely G-AdaGrad and Adam.
arXiv Detail & Related papers (2021-05-31T20:30:25Z) - Adam revisited: a weighted past gradients perspective [57.54752290924522]
We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues.
We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
arXiv Detail & Related papers (2021-01-01T14:01:52Z) - Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.