Loss Spike in Training Neural Networks
- URL: http://arxiv.org/abs/2305.12133v2
- Date: Sat, 05 Oct 2024 05:40:02 GMT
- Title: Loss Spike in Training Neural Networks
- Authors: Xiaolong Li, Zhi-Qin John Xu, Zhongwang Zhang,
- Abstract summary: We investigate the mechanism underlying loss spikes observed during neural network training.
From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components.
We experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction.
- Score: 9.848777377317901
- License:
- Abstract: In this work, we investigate the mechanism underlying loss spikes observed during neural network training. When the training enters a region with a lower-loss-as-sharper (LLAS) structure, the training becomes unstable, and the loss exponentially increases once the loss landscape is too sharp, resulting in the rapid ascent of the loss spike. The training stabilizes when it finds a flat region. From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components. We observe a deviation in the first eigendirection, which can be reasonably explained by the frequency principle, as low-frequency information is captured rapidly, leading to the rapid descent. Inspired by our analysis of loss spikes, we revisit the link between the maximum eigenvalue of the loss Hessian ($\lambda_{\mathrm{max}}$), flatness and generalization. We suggest that $\lambda_{\mathrm{max}}$ is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between $\lambda_{\mathrm{max}}$ and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes, $\lambda_{\mathrm{max}}$, and generalization.
Related papers
- Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [66.80315289020487]
Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget.
We show that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom.
Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch.
arXiv Detail & Related papers (2024-10-07T16:49:39Z) - Astral: training physics-informed neural networks with error majorants [45.24347017854392]
We argue that residual is, at best, an indirect measure of the error of approximate solution.
Since error majorant provides a direct upper bound on error, one can reliably estimate how close PiNN is to the exact solution.
arXiv Detail & Related papers (2024-06-04T13:11:49Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Optimal learning rate schedules in high-dimensional non-convex
optimization problems [14.058580956992051]
Learning rate schedules are ubiquitously used to speed up and improve optimisation.
We present a first analytical study of the role of neural scheduling in this setting.
arXiv Detail & Related papers (2022-02-09T15:15:39Z) - A variance principle explains why dropout finds flatter minima [0.0]
We show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training.
We propose a it Variance Principle that the variance of a noise is larger at the sharper direction of the loss landscape.
arXiv Detail & Related papers (2021-11-01T15:26:19Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts
Generalization [111.57403811375484]
We show that gradient descent implicitly penalizes the trace of the Fisher Information Matrix from the beginning of training.
We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training.
arXiv Detail & Related papers (2020-12-28T11:17:46Z) - Optimization and Generalization of Shallow Neural Networks with
Quadratic Activation Functions [11.70706646606773]
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks.
We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width.
We show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error.
arXiv Detail & Related papers (2020-06-27T22:13:20Z) - Implicitly Maximizing Margins with the Hinge Loss [0.0]
We show that for a linear classifier on linearly separable data with fixed step size, the margin of this modified hinge loss converges to the $ell$ max-margin at the rate of $mathcalO( 1/t )$.
empirical results suggest that this increased speed carries over to ReLU networks.
arXiv Detail & Related papers (2020-06-25T10:04:16Z) - Do We Need Zero Training Loss After Achieving Zero Training Error? [76.44358201918156]
We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
arXiv Detail & Related papers (2020-02-20T12:50:49Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.