Related papers: Do We Need Zero Training Loss After Achieving Zero Training Error?

Do We Need Zero Training Loss After Achieving Zero Training Error?

URL: http://arxiv.org/abs/2002.08709v2
Date: Wed, 31 Mar 2021 07:22:24 GMT
Title: Do We Need Zero Training Loss After Achieving Zero Training Error?
Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi Sugiyama
Abstract summary: We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value. We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
Score: 76.44358201918156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Overparameterized deep networks have the capacity to memorize training data with zero \emph{training error}. Even after memorization, the \emph{training loss} continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, it is hard to tune their hyperparameters in order to maintain a fixed/preset level of training loss. We propose a direct solution called \emph{flooding} that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the \emph{flood level}. Our approach makes the loss float around the flood level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flood level. This can be implemented with one line of code and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.

Related papers

Careful with that Scalpel: Improving Gradient Surgery with an EMA [32.73961859864032]
We show how one can improve performance by blending the gradients beyond a simple sum. We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments.
arXiv Detail & Related papers (2024-02-05T13:37:00Z)
Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z)
Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs. Sparse training is one of the most common techniques to reduce these costs. In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z)
Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training. We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy. Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.