The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines
and Drifting Towards Wide Minima
- URL: http://arxiv.org/abs/2210.01513v2
- Date: Tue, 11 Apr 2023 17:50:58 GMT
- Title: The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines
and Drifting Towards Wide Minima
- Authors: Peter L. Bartlett, Philip M. Long and Olivier Bousquet
- Abstract summary: We consider Sharpness-Aware Minimization, a gradient-based optimization method for deep networks.
We show that when SAM is applied with a convex quadratic objective, it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature.
In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian.
- Score: 41.961056785108845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization
method for deep networks that has exhibited performance improvements on image
and language prediction problems. We show that when SAM is applied with a
convex quadratic objective, for most random initializations it converges to a
cycle that oscillates between either side of the minimum in the direction with
the largest curvature, and we provide bounds on the rate of convergence.
In the non-quadratic case, we show that such oscillations effectively perform
gradient descent, with a smaller step-size, on the spectral norm of the
Hessian. In such cases, SAM's update may be regarded as a third derivative --
the derivative of the Hessian in the leading eigenvector direction -- that
encourages drift toward wider minima.
Related papers
- Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization [7.032245866317619]
We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task.
We prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss.
arXiv Detail & Related papers (2025-03-14T21:45:12Z) - Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity [59.75300530380427]
We consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries.
We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds.
arXiv Detail & Related papers (2024-06-28T02:56:22Z) - Sharp detection of low-dimensional structure in probability measures via dimensional logarithmic Sobolev inequalities [0.5592394503914488]
We introduce a method for identifying and approximating a target measure $pi$ as a perturbation of a given reference measure $mu$.
Our main contribution unveils a connection between the emphdimensional logarithmic Sobolev inequality (LSI) and approximations with this ansatz.
arXiv Detail & Related papers (2024-06-18T20:02:44Z) - Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization.
Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective.
We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z) - Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression.
We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - Implicit Sparse Regularization: The Impact of Depth and Early Stopping [35.4113861165802]
We show that early stopping is crucial for gradient descent to converge to a sparse model.
We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery.
arXiv Detail & Related papers (2021-08-12T07:43:29Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Potential Function-based Framework for Making the Gradients Small in
Convex and Min-Max Optimization [14.848525762485872]
Making the gradients small is a fundamental optimization problem that has eluded unifying and simple convergence arguments.
We introduce a novel potential function-based framework to study the convergence of standard methods for making the gradients small.
arXiv Detail & Related papers (2021-01-28T16:41:00Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.