Related papers: Adam with model exponential moving average is effective for nonconvex optimization

Adam with model exponential moving average is effective for nonconvex optimization

URL: http://arxiv.org/abs/2405.18199v2
Date: Wed, 30 Oct 2024 17:51:28 GMT
Title: Adam with model exponential moving average is effective for nonconvex optimization
Authors: Kwangjun Ahn, Ashok Cutkosky,
Abstract summary: We offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms as Adam, and (ii) the exponential moving average (EMA) model.
Score: 45.242009309234305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.

Related papers

Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning [5.052293146674794]
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of gradient descent (SGD) optimization methods such as the popular ADAM.<n>In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM) in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the gradients with the smallest optimization error.
arXiv Detail & Related papers (2025-05-28T08:07:34Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations [0.0]
We propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms. We perform numerical simulations of these equations to demonstrate their validity as accurate approximations of the original algorithms.
arXiv Detail & Related papers (2024-11-14T19:00:01Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance.
arXiv Detail & Related papers (2024-07-10T18:11:40Z)
Variational Stochastic Gradient Descent for Deep Neural Networks [16.96187187108041]
Current state-of-the-arts are adaptive gradient-based optimization methods such as Adam. Here, we propose to combine both approaches, resulting in the Variational Gradient Descent (VSGD) We show how our VSGD method relates to other adaptive gradient-baseds like Adam.
arXiv Detail & Related papers (2024-04-09T18:02:01Z)
Delegating Data Collection in Decentralized Machine Learning [67.0537668772372]
Motivated by the emergence of decentralized machine learning (ML) ecosystems, we study the delegation of data collection. We design optimal and near-optimal contracts that deal with two fundamental information asymmetries. We show that a principal can cope with such asymmetry via simple linear contracts that achieve 1-1/e fraction of the optimal utility.
arXiv Detail & Related papers (2023-09-04T22:16:35Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems. We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods [30.084554989542475]
We present a new framework for Adam-type methods with the trend information when updating the parameters with the adaptive step size and gradients. We show empirically the importance of adding the trend component, where our framework outperforms the conventional Adam and AMSGrad methods constantly on the classical models with several real-world datasets.
arXiv Detail & Related papers (2020-01-17T01:23:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.