Analytic theory of dropout regularization
- URL: http://arxiv.org/abs/2505.07792v1
- Date: Mon, 12 May 2025 17:45:02 GMT
- Title: Analytic theory of dropout regularization
- Authors: Francesco Mori, Francesca Mignacco,
- Abstract summary: Dropout is a regularization technique widely used in training artificial neural networks.<n>We analytically study dropout in two-layer neural networks trained with online gradient descent.
- Score: 1.243080988483032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.
Related papers
- Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z) - FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups [10.274236106456758]
We show that models trained with empirical risk minimization tend to generalize well for examples from the majority groups while memorizing instances from minority groups.<n>We apply example-tied dropout as a method we term FairDropout, aimed at redirecting this memorization to specific neurons that we subsequently drop out during inference.<n>We empirically evaluate FairDropout using the subpopulation benchmark suite encompassing vision, language, and healthcare tasks, demonstrating that it significantly reduces reliance on spurious correlations, and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-02-10T17:18:54Z) - Y-Drop: A Conductance based Dropout for fully connected layers [63.029110722758496]
We introduce Y-Drop, a regularization method that biases the dropout algorithm towards dropping more important neurons with higher probability.
We show that forcing the network to solve the task at hand in the absence of its important units yields a strong regularization effect.
arXiv Detail & Related papers (2024-09-11T15:56:08Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Implicit regularization of dropout [3.42658286826597]
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training.
In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments.
We experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training.
arXiv Detail & Related papers (2022-07-13T04:09:14Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - Plateau Phenomenon in Gradient Descent Training of ReLU networks:
Explanation, Quantification and Avoidance [0.0]
In general, neural networks are trained by gradient type optimization methods.
The loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down.
The present work aims to identify and quantify the root causes of plateau phenomenon.
arXiv Detail & Related papers (2020-07-14T17:33:26Z) - Regularizing Class-wise Predictions via Self-knowledge Distillation [80.76254453115766]
We propose a new regularization method that penalizes the predictive distribution between similar samples.
This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network.
Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve the generalization ability.
arXiv Detail & Related papers (2020-03-31T06:03:51Z) - The Implicit and Explicit Regularization Effects of Dropout [43.431343291010734]
Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures.
This work demonstrates that dropout introduces two distinct but entangled regularization effects.
arXiv Detail & Related papers (2020-02-28T18:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.