Related papers: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

URL: http://arxiv.org/abs/2508.01483v1
Date: Sat, 02 Aug 2025 20:36:52 GMT
Title: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
Authors: Aleksandr Dremov, Alexander Hägele, Atli Kosson, Martin Jaggi,
Abstract summary: We provide a comprehensive analysis solely on the phase in the Warmup-Stable scheduling scheduler.<n>Our analysis reveals that different shapes reveal a fundamental bias-off in the resulting models.<n>We also provide visualizations of the landscape, supporting the river valley loss perspective.
Score: 106.59372118904957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode{x2013}$ comparable to those from cooldown shape selection $\unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

Related papers

An Adaptive Volatility-based Learning Rate Scheduler [0.0]
VolSched is a novel LR scheduler inspired by the concept of volatility in processes like Geometric Brownian Motion.<n>By calculating the ratio between long-term and short-term accuracy volatility, VolSched increases the LR to escape plateaus and decreases it to stabilize training.
arXiv Detail & Related papers (2025-07-11T05:45:53Z)
The Epochal Sawtooth Phenomenon: Unveiling Training Loss Oscillations in Adam and Other Optimizers [8.770864706004472]
We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Phenomenon (ESP)<n>This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
arXiv Detail & Related papers (2024-10-14T00:51:21Z)
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [66.80315289020487]
Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget.<n>We show that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom.<n>Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch.
arXiv Detail & Related papers (2024-10-07T16:49:39Z)
Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach [11.878350833222711]
We propose a method called em GradSamp for sampling gradient updates from a Gaussian distribution. em GradSamp not only streamlines gradient but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models.
arXiv Detail & Related papers (2024-06-11T15:01:20Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Leveraging Predictions in Smoothed Online Convex Optimization via Gradient-based Algorithms [18.64335888217192]
We consider online convex optimization with time-varying stage costs and additional switching costs. Since the switching costs introduce coupling across all stages, long-term predictions tend to suffer from lower quality. We introduce a gradient-based online algorithm, Receding Horizon Inexact Gradient (RHIG), and analyze its performance by dynamic regrets.
arXiv Detail & Related papers (2020-11-25T06:25:51Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.