When to restart? Exploring escalating restarts on convergence
- URL: http://arxiv.org/abs/2603.04117v1
- Date: Wed, 04 Mar 2026 14:35:27 GMT
- Title: When to restart? Exploring escalating restarts on convergence
- Authors: Ayush K. Varshney, Šarūnas Girdzijauskas, Konstantinos Vandikas, Aneta Vulgarakis Feljan,
- Abstract summary: We propose a simple yet effective strategy, which we call Descent with Escalating Restarts (SGD-ER)<n>Our method monitors training progress and triggers restarts when stagnation is detected, linearly escalating the learning rate to escape sharp local minima.<n>Compared to standard schedulers, SGD-ER improves test accuracy by 0.5-4.5%, demonstrating the benefit of convergence-aware escalating restarts for better local optima.
- Score: 0.06524460254566904
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning rate scheduling plays a critical role in the optimization of deep neural networks, directly influencing convergence speed, stability, and generalization. While existing schedulers such as cosine annealing, cyclical learning rates, and warm restarts have shown promise, they often rely on fixed or periodic triggers that are agnostic to the training dynamics, such as stagnation or convergence behavior. In this work, we propose a simple yet effective strategy, which we call Stochastic Gradient Descent with Escalating Restarts (SGD-ER). It adaptively increases the learning rate upon convergence. Our method monitors training progress and triggers restarts when stagnation is detected, linearly escalating the learning rate to escape sharp local minima and explore flatter regions of the loss landscape. We evaluate SGD-ER across CIFAR-10, CIFAR-100, and TinyImageNet on a range of architectures including ResNet-18/34/50, VGG-16, and DenseNet-101. Compared to standard schedulers, SGD-ER improves test accuracy by 0.5-4.5%, demonstrating the benefit of convergence-aware escalating restarts for better local optima.
Related papers
- Synchrony-Gated Plasticity with Dopamine Modulation for Spiking Neural Networks [6.085945372100414]
Dopamine-Modulated Spike-Synchrony-Dependent Plasticity (DA-SSDP) is a synchrony-based rule that is sensitive to loss.<n>DA-SSDP condenses spike patterns into a synchrony metric at the batch level.
arXiv Detail & Related papers (2025-12-08T06:10:44Z) - Plug-and-Play Homeostatic Spark: Zero-Cost Acceleration for SNN Training Across Paradigms [40.57310813106791]
Spiking neural networks offer event driven computation, sparse activation, and hardware efficiency, yet training often converges slowly and lacks stability.<n>We present Adaptive Homeostatic Spiking Activity Regulation (AHSAR), an extremely simple plug in and training paradigm method.<n>AHSAR stabilizes optimization and accelerates convergence without changing the model architecture, loss, or gradients.
arXiv Detail & Related papers (2025-12-04T17:26:46Z) - Training in reverse: How iteration order influences convergence and stability in deep learning [7.702226188904773]
Training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence.<n>This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime.<n>We show that the order of gradient updates affects stability and convergence in gradient-based iterations.<n>We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order.
arXiv Detail & Related papers (2025-02-03T17:40:03Z) - Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice [7.494722456816369]
We propose a probabilistic learning rate scheduler (PLRS)<n>PLRS does not conform to the monotonically decreasing condition, with provable convergence guarantees.<n>We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy.
arXiv Detail & Related papers (2024-07-10T12:52:24Z) - Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method.
We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization.
We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - RIFLE: Backpropagation in Depth for Deep Transfer Learning through
Re-Initializing the Fully-connected LayEr [60.07531696857743]
Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task.
We propose RIFLE - a strategy that deepens backpropagation in transfer learning settings.
RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning.
arXiv Detail & Related papers (2020-07-07T11:27:43Z) - Convolutional Neural Network Training with Distributed K-FAC [14.2773046188145]
Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix.
We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale.
arXiv Detail & Related papers (2020-07-01T22:00:53Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging [48.99717153937717]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.<n>We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.<n>Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.