A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima
- URL: http://arxiv.org/abs/2002.03495v14
- Date: Mon, 22 Jun 2020 03:52:54 GMT
- Title: A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima
- Authors: Zeke Xie, Issei Sato, and Masashi Sugiyama
- Abstract summary: We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
- Score: 91.11332770406007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Gradient Descent (SGD) and its variants are mainstream methods for
training deep networks in practice. SGD is known to find a flat minimum that
often generalizes well. However, it is mathematically unclear how deep learning
can select a flat minimum among so many minima. To answer the question
quantitatively, we develop a density diffusion theory (DDT) to reveal how
minima selection quantitatively depends on the minima sharpness and the
hyperparameters. To the best of our knowledge, we are the first to
theoretically and empirically prove that, benefited from the Hessian-dependent
covariance of stochastic gradient noise, SGD favors flat minima exponentially
more than sharp minima, while Gradient Descent (GD) with injected white noise
favors flat minima only polynomially more than sharp minima. We also reveal
that either a small learning rate or large-batch training requires
exponentially many iterations to escape from minima in terms of the ratio of
the batch size and learning rate. Thus, large-batch training cannot search flat
minima efficiently in a realistic computational time.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - The Effect of SGD Batch Size on Autoencoder Learning: Sparsity,
Sharpness, and Feature Learning [14.004531386769328]
We investigate the dynamics of gradient descent (SGD) when a single-neuron autoencoder is used.
For any batch size smaller than the number of samples, SGD finds a global minimum which is sparse and nearly strictly to its randomness.
arXiv Detail & Related papers (2023-08-06T21:54:07Z) - How to escape sharp minima with random perturbations [48.095392390925745]
We study the notion of flat minima and the complexity of finding them.
For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently.
For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization.
arXiv Detail & Related papers (2023-05-25T02:12:33Z) - Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization [33.50116027503244]
We show that the zeroth-order flatness can be insufficient to discriminate minima with low gradient error.
We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions.
arXiv Detail & Related papers (2023-03-03T16:58:53Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - A variance principle explains why dropout finds flatter minima [0.0]
We show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training.
We propose a it Variance Principle that the variance of a noise is larger at the sharper direction of the loss landscape.
arXiv Detail & Related papers (2021-11-01T15:26:19Z) - Unveiling the structure of wide flat minima in neural networks [0.46664938579243564]
Deep learning has revealed the application potential of networks across the sciences.
The success of deep learning has revealed the application potential of networks across the sciences.
arXiv Detail & Related papers (2021-07-02T16:04:57Z) - Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM
in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed.
Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.