Loss Spike in Training Neural Networks
- URL: http://arxiv.org/abs/2305.12133v1
- Date: Sat, 20 May 2023 07:57:15 GMT
- Title: Loss Spike in Training Neural Networks
- Authors: Zhongwang Zhang, Zhi-Qin John Xu
- Abstract summary: We study the mechanism underlying loss spikes observed during neural network training.
In this work, we revisit the link between $lambda_mathrmmax$ flatness and generalization.
- Score: 3.42658286826597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study the mechanism underlying loss spikes observed during
neural network training. When the training enters a region, which has a
smaller-loss-as-sharper (SLAS) structure, the training becomes unstable and
loss exponentially increases once it is too sharp, i.e., the rapid ascent of
the loss spike. The training becomes stable when it finds a flat region. The
deviation in the first eigen direction (with maximum eigenvalue of the loss
Hessian ($\lambda_{\mathrm{max}}$) is found to be dominated by low-frequency.
Since low-frequency is captured very fast (frequency principle), the rapid
descent is then observed. Inspired by our analysis of loss spikes, we revisit
the link between $\lambda_{\mathrm{max}}$ flatness and generalization. For real
datasets, low-frequency is often dominant and well-captured by both the
training data and the test data. Then, a solution with good generalization and
a solution with bad generalization can both learn low-frequency well, thus,
they have little difference in the sharpest direction. Therefore, although
$\lambda_{\mathrm{max}}$ can indicate the sharpness of the loss landscape,
deviation in its corresponding eigen direction is not responsible for the
generalization difference. We also find that loss spikes can facilitate
condensation, i.e., input weights evolve towards the same, which may be the
underlying mechanism for why the loss spike improves generalization, rather
than simply controlling the value of $\lambda_{\mathrm{max}}$.
Related papers
- A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation [12.321507997896218]
We study the dynamics of gradient flow with small weight decay on general training losses $F: mathbbRd to mathbbR$.
arXiv Detail & Related papers (2025-05-26T16:12:45Z) - Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [66.80315289020487]
Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget.
We show that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom.
Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch.
arXiv Detail & Related papers (2024-10-07T16:49:39Z) - Astral: training physics-informed neural networks with error majorants [45.24347017854392]
We argue that residual is, at best, an indirect measure of the error of approximate solution.
Since error majorant provides a direct upper bound on error, one can reliably estimate how close PiNN is to the exact solution.
arXiv Detail & Related papers (2024-06-04T13:11:49Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Optimal learning rate schedules in high-dimensional non-convex
optimization problems [14.058580956992051]
Learning rate schedules are ubiquitously used to speed up and improve optimisation.
We present a first analytical study of the role of neural scheduling in this setting.
arXiv Detail & Related papers (2022-02-09T15:15:39Z) - A variance principle explains why dropout finds flatter minima [0.0]
We show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training.
We propose a it Variance Principle that the variance of a noise is larger at the sharper direction of the loss landscape.
arXiv Detail & Related papers (2021-11-01T15:26:19Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts
Generalization [111.57403811375484]
We show that gradient descent implicitly penalizes the trace of the Fisher Information Matrix from the beginning of training.
We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training.
arXiv Detail & Related papers (2020-12-28T11:17:46Z) - Optimization and Generalization of Shallow Neural Networks with
Quadratic Activation Functions [11.70706646606773]
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks.
We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width.
We show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error.
arXiv Detail & Related papers (2020-06-27T22:13:20Z) - Implicitly Maximizing Margins with the Hinge Loss [0.0]
We show that for a linear classifier on linearly separable data with fixed step size, the margin of this modified hinge loss converges to the $ell$ max-margin at the rate of $mathcalO( 1/t )$.
empirical results suggest that this increased speed carries over to ReLU networks.
arXiv Detail & Related papers (2020-06-25T10:04:16Z) - Do We Need Zero Training Loss After Achieving Zero Training Error? [76.44358201918156]
We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
arXiv Detail & Related papers (2020-02-20T12:50:49Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.