The Implicit and Explicit Regularization Effects of Dropout
- URL: http://arxiv.org/abs/2002.12915v3
- Date: Thu, 15 Oct 2020 07:44:22 GMT
- Title: The Implicit and Explicit Regularization Effects of Dropout
- Authors: Colin Wei, Sham Kakade, Tengyu Ma
- Abstract summary: Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures.
This work demonstrates that dropout introduces two distinct but entangled regularization effects.
- Score: 43.431343291010734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dropout is a widely-used regularization technique, often required to obtain
state-of-the-art for a number of architectures. This work demonstrates that
dropout introduces two distinct but entangled regularization effects: an
explicit effect (also studied in prior work) which occurs since dropout
modifies the expected training objective, and, perhaps surprisingly, an
additional implicit effect from the stochasticity in the dropout training
update. This implicit regularization effect is analogous to the effect of
stochasticity in small mini-batch stochastic gradient descent. We disentangle
these two effects through controlled experiments. We then derive analytic
simplifications which characterize each effect in terms of the derivatives of
the model and the loss, for deep neural networks. We demonstrate these
simplified, analytic regularizers accurately capture the important aspects of
dropout, showing they faithfully replace dropout in practice.
Related papers
- The Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers [8.770864706004472]
We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Effect (ESE)
This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect.
arXiv Detail & Related papers (2024-10-14T00:51:21Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Stochastic Modified Equations and Dynamics of Dropout Algorithm [4.811269936680572]
Dropout is a widely utilized regularization technique in the training of neural networks.
Its underlying mechanism and its impact on achieving good abilities remain poorly understood.
arXiv Detail & Related papers (2023-05-25T08:42:25Z) - Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Implicit regularization of dropout [3.42658286826597]
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training.
In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments.
We experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training.
arXiv Detail & Related papers (2022-07-13T04:09:14Z) - DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization [125.5448293005647]
We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL.
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions.
We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
arXiv Detail & Related papers (2021-12-09T06:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.