Understanding AdamW through Proximal Methods and Scale-Freeness
- URL: http://arxiv.org/abs/2202.00089v1
- Date: Mon, 31 Jan 2022 21:00:55 GMT
- Title: Understanding AdamW through Proximal Methods and Scale-Freeness
- Authors: Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona
- Abstract summary: Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
- Score: 57.47324825501137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adam has been widely adopted for training deep neural networks due to less
hyperparameter tuning and remarkable performance. To improve generalization,
Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred
to as Adam-$\ell_2$). However, even better performance can be obtained with
AdamW, which decouples the gradient of the regularizer from the update rule of
Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the
advantages of AdamW. In this paper, we tackle this question from both an
optimization and an empirical point of view. First, we show how to re-interpret
AdamW as an approximation of a proximal gradient method, which takes advantage
of the closed-form proximal mapping of the regularizer instead of only
utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the
property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart:
their updates are invariant to component-wise rescaling of the gradients. We
provide empirical evidence across a wide range of deep learning experiments
showing a correlation between the problems in which AdamW exhibits an advantage
over Adam-$\ell_2$ and the degree to which we expect the gradients of the
network to exhibit multiple scales, thus motivating the hypothesis that the
advantage of AdamW could be due to the scale-free updates.
Related papers
- Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization [5.896194021915813]
Adam with weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks.
We make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization.
We show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss.
arXiv Detail & Related papers (2024-04-05T23:56:50Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD.
We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Investigating Alternatives to the Root Mean Square for Adaptive Gradient
Methods [20.531576904743282]
Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance.
Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients.
We theoretically and empirically characterize the influence of different $Lp$ norms on adaptive gradient methods for the first time.
arXiv Detail & Related papers (2021-06-10T01:38:37Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - EAdam Optimizer: How $\epsilon$ Impact Adam [7.0552555621312605]
We discuss the impact of the constant $epsilon$ for Adam in this paper.
Based on this finding, we propose a new variant of Adam called EAdam.
Our method can bring significant improvement compared with Adam.
arXiv Detail & Related papers (2020-11-04T06:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.