A variance principle explains why dropout finds flatter minima
- URL: http://arxiv.org/abs/2111.01022v1
- Date: Mon, 1 Nov 2021 15:26:19 GMT
- Title: A variance principle explains why dropout finds flatter minima
- Authors: Zhongwang Zhang, Hanxu Zhou, Zhi-Qin John Xu
- Abstract summary: We show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training.
We propose a it Variance Principle that the variance of a noise is larger at the sharper direction of the loss landscape.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although dropout has achieved great success in deep learning, little is known
about how it helps the training find a good generalization solution in the
high-dimensional parameter space. In this work, we show that the training with
dropout finds the neural network with a flatter minimum compared with standard
gradient descent training. We further study the underlying mechanism of why
dropout finds flatter minima through experiments. We propose a {\it Variance
Principle} that the variance of a noise is larger at the sharper direction of
the loss landscape. Existing works show that SGD satisfies the variance
principle, which leads the training to flatter minima. Our work show that the
noise induced by the dropout also satisfies the variance principle that
explains why dropout finds flatter minima. In general, our work points out that
the variance principle is an important similarity between dropout and SGD that
lead the training to find flatter minima and obtain good generalization.
Related papers
- Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z) - Implicit regularization of dropout [3.42658286826597]
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training.
In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments.
We experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training.
arXiv Detail & Related papers (2022-07-13T04:09:14Z) - Combining resampling and reweighting for faithful stochastic
optimization [1.52292571922932]
When the loss function is a sum of multiple terms, a popular method is gradient descent.
We show that the difference in the Lipschitz constants of multiple terms in the loss function causes gradient descent to different variances at different minimums.
arXiv Detail & Related papers (2021-05-31T04:21:25Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - On regularization of gradient descent, layer imbalance and flat minima [9.08659783613403]
We analyze the training dynamics for deep linear networks using a new metric - imbalance - which defines the flatness of a solution.
We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way.
arXiv Detail & Related papers (2020-07-18T00:09:14Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.