Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
- URL: http://arxiv.org/abs/2404.04454v1
- Date: Fri, 5 Apr 2024 23:56:50 GMT
- Title: Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
- Authors: Shuo Xie, Zhiyuan Li,
- Abstract summary: Adam with weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks.
We make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization.
We show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss.
- Score: 5.896194021915813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
Related papers
- Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
We find that Adam performs worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
arXiv Detail & Related papers (2024-10-10T17:58:53Z) - Decoupled Weight Decay for Any $p$ Norm [1.1510009152620668]
We consider a simple yet effective approach to sparsification, based on the Bridge, $L_p$ regularization during training.
We introduce a novel weight decay scheme, which generalizes the standard $L$ weight decay to any $p$ norm.
We empirically demonstrate that it leads to highly sparse networks, while maintaining performance comparable to standard $L$ regularization.
arXiv Detail & Related papers (2024-04-16T18:02:15Z) - Closing the Gap Between the Upper Bound and the Lower Bound of Adam's
Iteration Complexity [51.96093077151991]
We derive a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption.
Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate.
arXiv Detail & Related papers (2023-10-27T09:16:58Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.