How to escape sharp minima with random perturbations
- URL: http://arxiv.org/abs/2305.15659v3
- Date: Sat, 25 May 2024 14:42:20 GMT
- Title: How to escape sharp minima with random perturbations
- Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra,
- Abstract summary: We study the notion of flat minima and the complexity of finding them.
For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently.
For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization.
- Score: 48.095392390925745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.
Related papers
- Deep Point-to-Plane Registration by Efficient Backpropagation for Error
Minimizing Function [0.0]
Traditional algorithms of point set registration often achieve a better estimation of rigid transformation than those minimizing point-to-point distances.
Recent deep-learning-based methods minimize the point-to-point distances.
This paper proposes the first deep-learning-based approach to point-to-plane registration.
arXiv Detail & Related papers (2022-07-14T05:18:20Z) - Questions for Flat-Minima Optimization of Modern Neural Networks [28.12506392321345]
Two methods for finding flat minima stand out: 1. Averaging methods (i.e. Weight Averaging, SWA) and 2. Minimax methods (i.e. Aware, Sharpness Minimization, SAM)
We investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks.
arXiv Detail & Related papers (2022-02-01T18:56:15Z) - Unveiling the structure of wide flat minima in neural networks [0.46664938579243564]
Deep learning has revealed the application potential of networks across the sciences.
The success of deep learning has revealed the application potential of networks across the sciences.
arXiv Detail & Related papers (2021-07-02T16:04:57Z) - Asymptotic study of stochastic adaptive algorithm in non-convex
landscape [2.1320960069210484]
This paper studies some assumption properties of adaptive algorithms widely used in optimization and machine learning.
Among them Adagrad and Rmsprop, which are involved in most of the blackbox deep learning algorithms.
arXiv Detail & Related papers (2020-12-10T12:54:45Z) - Activation Relaxation: A Local Dynamical Approximation to
Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system.
Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.