Do We Need Zero Training Loss After Achieving Zero Training Error?
- URL: http://arxiv.org/abs/2002.08709v2
- Date: Wed, 31 Mar 2021 07:22:24 GMT
- Title: Do We Need Zero Training Loss After Achieving Zero Training Error?
- Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi
Sugiyama
- Abstract summary: We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
- Score: 76.44358201918156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparameterized deep networks have the capacity to memorize training data
with zero \emph{training error}. Even after memorization, the \emph{training
loss} continues to approach zero, making the model overconfident and the test
performance degraded. Since existing regularizers do not directly aim to avoid
zero training loss, it is hard to tune their hyperparameters in order to
maintain a fixed/preset level of training loss. We propose a direct solution
called \emph{flooding} that intentionally prevents further reduction of the
training loss when it reaches a reasonably small value, which we call the
\emph{flood level}. Our approach makes the loss float around the flood level by
doing mini-batched gradient descent as usual but gradient ascent if the
training loss is below the flood level. This can be implemented with one line
of code and is compatible with any stochastic optimizer and other regularizers.
With flooding, the model will continue to "random walk" with the same non-zero
training loss, and we expect it to drift into an area with a flat loss
landscape that leads to better generalization. We experimentally show that
flooding improves performance and, as a byproduct, induces a double descent
curve of the test loss.
Related papers
- Careful with that Scalpel: Improving Gradient Surgery with an EMA [32.73961859864032]
We show how one can improve performance by blending the gradients beyond a simple sum.
We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments.
arXiv Detail & Related papers (2024-02-05T13:37:00Z) - Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z) - Balance is Essence: Accelerating Sparse Training via Adaptive Gradient
Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs.
Sparse training is one of the most common techniques to reduce these costs.
In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.