Flat Minima and Generalization: Insights from Stochastic Convex Optimization
- URL: http://arxiv.org/abs/2511.03548v1
- Date: Wed, 05 Nov 2025 15:31:42 GMT
- Title: Flat Minima and Generalization: Insights from Stochastic Convex Optimization
- Authors: Matan Schliserman, Shira Vansover-Hager, Tomer Koren,
- Abstract summary: Learning algorithms are successful in practice because they converge to flat minima.<n>We show that flat empirical minima may incur trivial $Omega(1)$ population risk while sharp minima generalizes optimally.
- Score: 22.768090791258242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, which have been consistently associated with improved generalization performance. In this work, we study the link between flat minima and generalization in the canonical setting of stochastic convex optimization with a non-negative, $\beta$-smooth objective. Our first finding is that, even in this fundamental and well-studied setting, flat empirical minima may incur trivial $\Omega(1)$ population risk while sharp minima generalizes optimally. Then, we show that this poor generalization behavior extends to two natural ''sharpness-aware'' algorithms originally proposed by Foret et al. (2021), designed to bias optimization toward flat solutions: Sharpness-Aware Gradient Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD, which performs gradient steps on the maximal loss in a predefined neighborhood, we prove that while it successfully converges to a flat minimum at a fast rate, the population risk of the solution can still be as large as $\Omega(1)$, indicating that even flat minima found algorithmically using a sharpness-aware gradient method might generalize poorly. For SAM, a computationally efficient approximation of SA-GD based on normalized ascent steps, we show that although it minimizes the empirical loss, it may converge to a sharp minimum and also incur population risk $\Omega(1)$. Finally, we establish population risk upper bounds for both SA-GD and SAM using algorithmic stability techniques.
Related papers
- Zeroth-Order Sharpness-Aware Learning with Exponential Tilting [5.409688800035885]
We explore new zeroth-order algorithms to solve a soft sharpness objective parameterized by a tilting parameter $t$.<n>We provide precise characterizations of the sharpness notions of the tilted SAM framework.<n>Our approach can be used as a gradient-free and memory-efficient alternative to SAM variants.
arXiv Detail & Related papers (2025-10-17T19:01:34Z) - Zeroth-Order Optimization Finds Flat Minima [51.41529512093436]
We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian.<n>We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions.
arXiv Detail & Related papers (2025-06-05T17:59:09Z) - Riemannian stochastic optimization methods avoid strict saddle points [68.80251170757647]
We show that policies under study avoid strict saddle points / submanifolds with probability 1.
This result provides an important sanity check as it shows that, almost always, the limit state of an algorithm can only be a local minimizer.
arXiv Detail & Related papers (2023-11-04T11:12:24Z) - How to escape sharp minima with random perturbations [48.095392390925745]
We study the notion of flat minima and the complexity of finding them.
For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently.
For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization.
arXiv Detail & Related papers (2023-05-25T02:12:33Z) - Sharpness-Aware Gradient Matching for Domain Generalization [84.14789746460197]
The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains.
The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape.
We present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM)
Our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks.
arXiv Detail & Related papers (2023-03-18T07:25:12Z) - Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization [33.50116027503244]
We show that the zeroth-order flatness can be insufficient to discriminate minima with low gradient error.
We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions.
arXiv Detail & Related papers (2023-03-03T16:58:53Z) - GA-SAM: Gradient-Strength based Adaptive Sharpness-Aware Minimization
for Improved Generalization [22.53923556656022]
Sharpness-Aware Minimization (SAM) algorithm has shown state-of-the-art generalization abilities in vision tasks.
SAM has some difficulty implying SAM to some natural language tasks, especially to models with drastic changes, such as RNNs.
We propose a Gradient-Strength based Adaptive Sharpness-Aware Minimization (GA-SAM) algorithm to help learn algorithms find flat minima that generalize better.
arXiv Detail & Related papers (2022-10-13T10:44:10Z) - Surrogate Gap Minimization Improves Sharpness-Aware Training [52.58252223573646]
Surrogate textbfGap Guided textbfSharpness-textbfAware textbfMinimization (GSAM) is a novel improvement over Sharpness-Aware Minimization (SAM) with negligible computation overhead.
GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities.
arXiv Detail & Related papers (2022-03-15T16:57:59Z) - Train simultaneously, generalize better: Stability of gradient-based
minimax learners [12.691047660244331]
We show a key role in the performance of the trained minimax model under both convex concave and non-concave minimax settings.
We discuss several numerical results indicating the role of optimization algorithms in the generalization of learned minimax models.
arXiv Detail & Related papers (2020-10-23T17:44:43Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.