Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
- URL: http://arxiv.org/abs/2510.26303v2
- Date: Sat, 01 Nov 2025 03:55:48 GMT
- Title: Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
- Authors: Beomhan Baek, Minhak Song, Chulhee Yun,
- Abstract summary: Adam is the de facto in deep learning, yet its theoretical understanding remains limited.<n>We study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data.<n>We construct a class of structured datasets where incremental Adam provably converges to the $ell_infty$-max-margin.
- Score: 26.492222550365735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
Related papers
- Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks [38.11287525994738]
We present the first theoretical characterization of how affects Adam's generalization.<n>Our results reveal that while both Adam and AdamW with proper weight decay converge to poor test error solutions, their mini-batch variants can achieve near-zero test error.
arXiv Detail & Related papers (2025-10-13T12:48:22Z) - Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z) - Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
We find that Adam performs worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.<n>Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
arXiv Detail & Related papers (2024-10-10T17:58:53Z) - The Implicit Bias of Adam on Separable Data [27.451499849532176]
We show that when training data are linearly separable, Adam converges towards a linear gradient that achieves diminishing learning rates.
Our result shed light on the difference between Adam and (stochastic) descent from a theoretical perspective.
arXiv Detail & Related papers (2024-06-15T14:39:37Z) - Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization [5.896194021915813]
Adam with weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks.
We make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization.
We show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss.
arXiv Detail & Related papers (2024-04-05T23:56:50Z) - Closing the Gap Between the Upper Bound and the Lower Bound of Adam's
Iteration Complexity [51.96093077151991]
We derive a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption.
Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate.
arXiv Detail & Related papers (2023-10-27T09:16:58Z) - UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic
Optimization [20.399244578926474]
We introduce a unified framework for Adam-type algorithms (called UAdam)
This is equipped with a general form of the second-order moment, such as NAdamBound, AdaFom, and Adan.
We show that UAdam converges to the neighborhood of stationary points with the rate of $mathcalO (1/T)$.
arXiv Detail & Related papers (2023-05-09T13:07:03Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.