Related papers: Flatter, faster: scaling momentum for optimal speedup of SGD

Flatter, faster: scaling momentum for optimal speedup of SGD

URL: http://arxiv.org/abs/2210.16400v2
Date: Tue, 13 Jun 2023 04:47:46 GMT
Title: Flatter, faster: scaling momentum for optimal speedup of SGD
Authors: Aditya Cowsik, Tankut Can and Paolo Glorioso
Abstract summary: We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks. We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1-\beta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We confirm our scaling rule for synthetic regression problems (matrix sensing and teacher-student paradigm) and classification for realistic datasets (ResNet-18 on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our scaling rule to variations in architectures and datasets.

Related papers

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z)
A Physics-Inspired Optimizer: Velocity Regularized Adam [16.047084318753377]
We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on system dynamics.<n>We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including CNNs, Transformers, and GFlowNets.
arXiv Detail & Related papers (2025-05-19T14:51:40Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
Asymmetric Momentum: A Rethinking of Gradient Descent [4.1001738811512345]
We propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM) By averaging the loss, we divide training process into different loss phases and using different momentum. We experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset.
arXiv Detail & Related papers (2023-09-05T11:16:47Z)
The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task. We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z)
Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation [0.0]
We develop an approximate hypergradient-based hyper parameter optimiser. It requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient.
arXiv Detail & Related papers (2021-10-20T09:57:57Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum [9.843647947055745]
In deep learning practice, momentum is usually weighted by a well-calibrated constant. We propose a novel emphadaptive momentum for improving DNNs training.
arXiv Detail & Related papers (2020-12-03T18:59:43Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection. Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.