A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias
- URL: http://arxiv.org/abs/2511.17378v1
- Date: Fri, 21 Nov 2025 16:41:14 GMT
- Title: A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias
- Authors: Wei-Kai Chang, Rajiv Khanna,
- Abstract summary: gradient descent (SGD) and its variants reliably find solutions that generalize well, but the mechanisms driving this generalization remain unclear.<n>We develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks.<n>Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.
- Score: 7.446140380340418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.
Related papers
- Why Some Models Resist Unlearning: A Linear Stability Perspective [7.446140380340418]
We frame unlearning through the lens of linear coherence stability.<n>We decompose coherence along three axes: within the retain set, within the forget set, and between them.<n>To further link data properties to forgettability, we study a two layer ReLU CNN under a signal plus noise model.<n>For empirical geometry, we show that Hessian tests and CNN heatmaps align closely with the predicted boundary, mapping the stability gradient based unlearning as a function of verification, mixing, and data/model alignment.
arXiv Detail & Related papers (2026-02-03T01:47:26Z) - Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z) - Topological Adaptive Least Mean Squares Algorithms over Simplicial Complexes [13.291627429657416]
This paper introduces a novel adaptive framework for processing dynamic flow signals over simplicial complexes.<n>We present a topological LMS algorithm that efficiently processes streaming signals observed over time-varying edge subsets.
arXiv Detail & Related papers (2025-05-29T06:55:19Z) - Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization [34.520966684699665]
Federated Learning (FL) is a distributed learning approach that trains machine learning models across multiple devices.<n>This paper introduces an innovative dynamics analysis framework, namely textitLibra, for algorithm generalization performance.<n>We show that larger local steps or momentum accelerate convergence of gradient norms, while worsening model stability.
arXiv Detail & Related papers (2024-11-25T11:43:22Z) - Stability and Generalization of the Decentralized Stochastic Gradient
Descent Ascent Algorithm [80.94861441583275]
We investigate the complexity of the generalization bound of the decentralized gradient descent (D-SGDA) algorithm.
Our results analyze the impact of different top factors on the generalization of D-SGDA.
We also balance it with the generalization to obtain the optimal convex-concave setting.
arXiv Detail & Related papers (2023-10-31T11:27:01Z) - Minibatch and Local SGD: Algorithmic Stability and Linear Speedup in Generalization [44.846861387342926]
Minibatch gradient descent (minibatch SGD) and local SGD are two popular methods for parallel optimization.<n>We study the stability and generalization analysis of minibatch and local SGD.<n>We show minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.
arXiv Detail & Related papers (2023-10-02T12:26:51Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - The Sobolev Regularization Effect of Stochastic Gradient Descent [8.193914488276468]
We show that flat minima regularize the gradient of the model function, which explains the good performance of flat minima.
We also consider high-order moments of gradient noise, and show that Gradient Dascent (SGD) tends to impose constraints on these moments by a linear analysis of SGD around global minima.
arXiv Detail & Related papers (2021-05-27T21:49:21Z) - Stability and Generalization of Stochastic Gradient Methods for Minimax
Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs)
We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.