Related papers: The Implicit Bias of Adam on Separable Data

The Implicit Bias of Adam on Separable Data

URL: http://arxiv.org/abs/2406.10650v1
Date: Sat, 15 Jun 2024 14:39:37 GMT
Title: The Implicit Bias of Adam on Separable Data
Authors: Chenyang Zhang, Difan Zou, Yuan Cao,
Abstract summary: We show that when training data are linearly separable, Adam converges towards a linear gradient that achieves diminishing learning rates. Our result shed light on the difference between Adam and (stochastic) descent from a theoretical perspective.
Score: 27.451499849532176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adam has become one of the most favored optimizers in deep learning problems. Despite its success in practice, numerous mysteries persist regarding its theoretical understanding. In this paper, we study the implicit bias of Adam in linear logistic regression. Specifically, we show that when the training data are linearly separable, Adam converges towards a linear classifier that achieves the maximum $\ell_\infty$-margin. Notably, for a general class of diminishing learning rates, this convergence occurs within polynomial time. Our result shed light on the difference between Adam and (stochastic) gradient descent from a theoretical perspective.

Related papers

The Rich and the Simple: On the Implicit Bias of Adam and SGD [22.211512632184398]
Adam is the de facto optimization algorithm for several deep learning applications.<n>In practice, neural networks trained with (stochastic) descent gradient (GD) are known to exhibit simplicity bias.<n>We show that Adam is more resistant to such simplicity bias.
arXiv Detail & Related papers (2025-05-29T21:46:12Z)
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning. We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z)
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models [23.520679217713685]
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks.
arXiv Detail & Related papers (2024-02-29T18:47:52Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$. AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$. We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.