Loss Spike in Training Neural Networks
- URL: http://arxiv.org/abs/2305.12133v1
- Date: Sat, 20 May 2023 07:57:15 GMT
- Title: Loss Spike in Training Neural Networks
- Authors: Zhongwang Zhang, Zhi-Qin John Xu
- Abstract summary: We study the mechanism underlying loss spikes observed during neural network training.
In this work, we revisit the link between $lambda_mathrmmax$ flatness and generalization.
- Score: 3.42658286826597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study the mechanism underlying loss spikes observed during
neural network training. When the training enters a region, which has a
smaller-loss-as-sharper (SLAS) structure, the training becomes unstable and
loss exponentially increases once it is too sharp, i.e., the rapid ascent of
the loss spike. The training becomes stable when it finds a flat region. The
deviation in the first eigen direction (with maximum eigenvalue of the loss
Hessian ($\lambda_{\mathrm{max}}$) is found to be dominated by low-frequency.
Since low-frequency is captured very fast (frequency principle), the rapid
descent is then observed. Inspired by our analysis of loss spikes, we revisit
the link between $\lambda_{\mathrm{max}}$ flatness and generalization. For real
datasets, low-frequency is often dominant and well-captured by both the
training data and the test data. Then, a solution with good generalization and
a solution with bad generalization can both learn low-frequency well, thus,
they have little difference in the sharpest direction. Therefore, although
$\lambda_{\mathrm{max}}$ can indicate the sharpness of the loss landscape,
deviation in its corresponding eigen direction is not responsible for the
generalization difference. We also find that loss spikes can facilitate
condensation, i.e., input weights evolve towards the same, which may be the
underlying mechanism for why the loss spike improves generalization, rather
than simply controlling the value of $\lambda_{\mathrm{max}}$.
Related papers
- Noisy Interpolation Learning with Shallow Univariate ReLU Networks [33.900009202637285]
Mallinar et. al. 2022 noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error.
We provide the first rigorous analysis of the overfitting behavior of regression with minimum weights.
arXiv Detail & Related papers (2023-07-28T08:41:12Z) - Implicit Regularization Leads to Benign Overfitting for Sparse Linear
Regression [16.551664358490658]
In deep learning, often the training process finds an interpolator (a solution with 0 training loss) but the test loss is still low.
One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator.
We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss.
arXiv Detail & Related papers (2023-02-01T05:41:41Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Learning with Noisy Labels via Sparse Regularization [76.31104997491695]
Learning with noisy labels is an important task for training accurate deep neural networks.
Some commonly-used loss functions, such as Cross Entropy (CE), suffer from severe overfitting to noisy labels.
We introduce the sparse regularization strategy to approximate the one-hot constraint.
arXiv Detail & Related papers (2021-07-31T09:40:23Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Implicitly Maximizing Margins with the Hinge Loss [0.0]
We show that for a linear classifier on linearly separable data with fixed step size, the margin of this modified hinge loss converges to the $ell$ max-margin at the rate of $mathcalO( 1/t )$.
empirical results suggest that this increased speed carries over to ReLU networks.
arXiv Detail & Related papers (2020-06-25T10:04:16Z) - Flatness is a False Friend [0.7614628596146599]
Hessian based measures of flatness have been argued, used and shown to relate to generalisation.
We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
arXiv Detail & Related papers (2020-06-16T11:55:24Z) - Do We Need Zero Training Loss After Achieving Zero Training Error? [76.44358201918156]
We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
arXiv Detail & Related papers (2020-02-20T12:50:49Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.