Dropout Reduces Underfitting
- URL: http://arxiv.org/abs/2303.01500v2
- Date: Wed, 31 May 2023 17:47:18 GMT
- Title: Dropout Reduces Underfitting
- Authors: Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, Trevor Darrell
- Abstract summary: In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
- Score: 85.61466286688385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Introduced by Hinton et al. in 2012, dropout has stood the test of time as a
regularizer for preventing overfitting in neural networks. In this study, we
demonstrate that dropout can also mitigate underfitting when used at the start
of training. During the early phase, we find dropout reduces the directional
variance of gradients across mini-batches and helps align the mini-batch
gradients with the entire dataset's gradient. This helps counteract the
stochasticity of SGD and limit the influence of individual batches on model
training. Our findings lead us to a solution for improving performance in
underfitting models - early dropout: dropout is applied only during the initial
phases of training, and turned off afterwards. Models equipped with early
dropout achieve lower final training loss compared to their counterparts
without dropout. Additionally, we explore a symmetric technique for
regularizing overfitting models - late dropout, where dropout is not used in
the early iterations and is only activated later in training. Experiments on
ImageNet and various vision tasks demonstrate that our methods consistently
improve generalization accuracy. Our results encourage more research on
understanding regularization in deep learning and our methods can be useful
tools for future neural network training, especially in the era of large data.
Code is available at https://github.com/facebookresearch/dropout.
Related papers
- Instance-dependent Early Stopping [57.912273923450726]
We propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level.
IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero.
IES can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.
arXiv Detail & Related papers (2025-02-11T13:34:09Z) - Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting [15.251425165987987]
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities.
We propose a sample weighting scheme for the fine-tuning data based on the pre-trained model's losses.
We empirically demonstrate the efficacy of our method on both language and vision tasks.
arXiv Detail & Related papers (2025-02-05T00:49:59Z) - Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free
Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data.
We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated.
We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Implicit regularization of dropout [3.42658286826597]
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training.
In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments.
We experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training.
arXiv Detail & Related papers (2022-07-13T04:09:14Z) - Neuron-Specific Dropout: A Deterministic Regularization Technique to
Prevent Neural Networks from Overfitting & Reduce Dependence on Large
Training Samples [0.0]
NSDropout looks at both the training pass, and validation pass, of a layer in a model.
By comparing the average values produced by each neuron for each class in a data set, the network is able to drop targeted units.
The layer is able to predict what features, or noise, the model is looking at during testing that isn't present when looking at samples from validation.
arXiv Detail & Related papers (2022-01-13T13:10:30Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - Do We Need Zero Training Loss After Achieving Zero Training Error? [76.44358201918156]
We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
arXiv Detail & Related papers (2020-02-20T12:50:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.