Related papers: The Epochal Sawtooth Phenomenon: Unveiling Training Loss Oscillations in Adam and Other Optimizers

The Epochal Sawtooth Phenomenon: Unveiling Training Loss Oscillations in Adam and Other Optimizers

URL: http://arxiv.org/abs/2410.10056v3
Date: Wed, 18 Jun 2025 01:31:52 GMT
Title: The Epochal Sawtooth Phenomenon: Unveiling Training Loss Oscillations in Adam and Other Optimizers
Authors: Qi Liu, Wanjing Ma,
Abstract summary: We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Phenomenon (ESP)<n>This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
Score: 8.770864706004472
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we identify and analyze a recurring training loss pattern, which we term the \textit{Epochal Sawtooth Phenomenon (ESP)}, commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We empirically analyze the mechanisms underlying ESP, focusing on key factors such as Adam's $\beta$ parameters, batch size, data shuffling, and sample replacement. Our analysis shows that ESP arises from adaptive learning rate adjustments controlled by the second moment estimate. Additionally, we identify the ``immediate re-exposure to samples'' effect during data shuffling, which causes the model to learn or memorize more at the beginning of each epoch. We also find that smaller values of $\beta_2$ exacerbate ESP but can act as a form of regularization. While ESP is not necessarily indicative of overfitting, higher model capacity can amplify the phenomenon. To further support our analysis, we replicate ESP through a high-dimensional quadratic minimization task. We demonstrate that ESP can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. The code for reproducing our experiments is available at https://github.com/qiliuchn/training-loss-pattern.

Related papers

Post-Hoc Reversal: Are We Selecting Models Prematurely? [13.910702424593797]
We show a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms. Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples. We propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions.
arXiv Detail & Related papers (2024-04-11T14:58:19Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent [13.27004430044574]
Differentially private gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks. We compare the behavior of the two processes separately in early and late epochs. We find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result.
arXiv Detail & Related papers (2023-11-12T13:31:35Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Spectral Evolution and Invariance in Linear-width Neural Networks [8.419660614226816]
We investigate the spectral properties of linear-width feed-forward neural networks. We show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel exhibit heavy tail behavior.
arXiv Detail & Related papers (2022-11-11T23:00:30Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time. This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization. We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z)
On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective. We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities. We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z)
Double Descent Optimization Pattern and Aliasing: Caveats of Noisy Labels [1.4424394176890545]
This work confirms that double descent occurs with small datasets and noisy labels. We show that increasing the learning rate can create analiasing effect that masks the double descent pattern without suppressing it. We show that they translate to a real world application: the forecast of events in epileptic patients from continuous electroencephalographic recordings.
arXiv Detail & Related papers (2021-06-03T19:41:40Z)
Reweighting Augmented Samples by Minimizing the Maximal Expected Loss [51.2791895511333]
We construct the maximal expected loss which is the supremum over any reweighted loss on augmented samples. Inspired by adversarial training, we minimize this maximal expected loss and obtain a simple and interpretable closed-form solution. The proposed method can generally be applied on top of any data augmentation methods.
arXiv Detail & Related papers (2021-03-16T09:31:04Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Implicit and Explicit Regularization Effects of Dropout [43.431343291010734]
Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures. This work demonstrates that dropout introduces two distinct but entangled regularization effects.
arXiv Detail & Related papers (2020-02-28T18:31:17Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.