Related papers: The Rich and the Simple: On the Implicit Bias of Adam and SGD

The Rich and the Simple: On the Implicit Bias of Adam and SGD

URL: http://arxiv.org/abs/2505.24022v1
Date: Thu, 29 May 2025 21:46:12 GMT
Title: The Rich and the Simple: On the Implicit Bias of Adam and SGD
Authors: Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi,
Abstract summary: Adam is the de facto optimization algorithm for several deep learning applications.<n>In practice, neural networks trained with (stochastic) descent gradient (GD) are known to exhibit simplicity bias.<n>We show that Adam is more resistant to such simplicity bias.
Score: 22.211512632184398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. To demystify this phenomenon, in this paper, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU neural networks on a binary classification task involving synthetic data with Gaussian clusters. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. To corroborate our theoretical findings, we present empirical results showing that this property of Adam leads to superior generalization across datasets with spurious correlations where neural networks trained with SGD are known to show simplicity bias and don't generalize well under certain distributional shifts.

Related papers

Graph Out-of-Distribution Generalization via Causal Intervention [69.70137479660113]
We introduce a conceptually simple yet principled approach for training robust graph neural networks (GNNs) under node-level distribution shifts. Our method resorts to a new learning objective derived from causal inference that coordinates an environment estimator and a mixture-of-expert GNN predictor. Our model can effectively enhance generalization with various types of distribution shifts and yield up to 27.4% accuracy improvement over state-of-the-arts on graph OOD generalization benchmarks.
arXiv Detail & Related papers (2024-02-18T07:49:22Z)
Alleviating Structural Distribution Shift in Graph Anomaly Detection [70.1022676681496]
Graph anomaly detection (GAD) is a challenging binary classification problem. Gallon neural networks (GNNs) benefit the classification of normals from aggregating homophilous neighbors. We propose a framework to mitigate the effect of heterophilous neighbors and make them invariant.
arXiv Detail & Related papers (2024-01-25T13:07:34Z)
AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results. We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z)
When Neural Networks Fail to Generalize? A Model Sensitivity Perspective [82.36758565781153]
Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions. This paper considers a more realistic yet more challenging scenario, namely Single Domain Generalization (Single-DG) We empirically ascertain a property of a model that correlates strongly with its generalization that we coin as "model sensitivity" We propose a novel strategy of Spectral Adversarial Data Augmentation (SADA) to generate augmented images targeted at the highly sensitive frequencies.
arXiv Detail & Related papers (2022-12-01T20:15:15Z)
Discovering Invariant Rationales for Graph Neural Networks [104.61908788639052]
Intrinsic interpretability of graph neural networks (GNNs) is to find a small subset of the input graph's features. We propose a new strategy of discovering invariant rationale (DIR) to construct intrinsically interpretable GNNs.
arXiv Detail & Related papers (2022-01-30T16:43:40Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks. Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge. We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z)
Fast Learning of Graph Neural Networks with Guaranteed Generalizability: One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice. We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.