Flatter, faster: scaling momentum for optimal speedup of SGD
- URL: http://arxiv.org/abs/2210.16400v2
- Date: Tue, 13 Jun 2023 04:47:46 GMT
- Title: Flatter, faster: scaling momentum for optimal speedup of SGD
- Authors: Aditya Cowsik, Tankut Can and Paolo Glorioso
- Abstract summary: We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commonly used optimization algorithms often show a trade-off between good
generalization and fast training times. For instance, stochastic gradient
descent (SGD) tends to have good generalization; however, adaptive gradient
methods have superior training times. Momentum can help accelerate training
with SGD, but so far there has been no principled way to select the momentum
hyperparameter. Here we study training dynamics arising from the interplay
between SGD with label noise and momentum in the training of overparametrized
neural networks. We find that scaling the momentum hyperparameter $1-\beta$
with the learning rate to the power of $2/3$ maximally accelerates training,
without sacrificing generalization. To analytically derive this result we
develop an architecture-independent framework, where the main assumption is the
existence of a degenerate manifold of global minimizers, as is natural in
overparametrized models. Training dynamics display the emergence of two
characteristic timescales that are well-separated for generic values of the
hyperparameters. The maximum acceleration of training is reached when these two
timescales meet, which in turn determines the scaling limit we propose. We
confirm our scaling rule for synthetic regression problems (matrix sensing and
teacher-student paradigm) and classification for realistic datasets (ResNet-18
on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our
scaling rule to variations in architectures and datasets.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Asymmetric Momentum: A Rethinking of Gradient Descent [4.1001738811512345]
We propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM)
By averaging the loss, we divide training process into different loss phases and using different momentum.
We experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset.
arXiv Detail & Related papers (2023-09-05T11:16:47Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Scalable One-Pass Optimisation of High-Dimensional Weight-Update
Hyperparameters by Implicit Differentiation [0.0]
We develop an approximate hypergradient-based hyper parameter optimiser.
It requires only one training episode, with no restarts.
We also provide a motivating argument for convergence to the true hypergradient.
arXiv Detail & Related papers (2021-10-20T09:57:57Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum [9.843647947055745]
In deep learning practice, momentum is usually weighted by a well-calibrated constant.
We propose a novel emphadaptive momentum for improving DNNs training.
arXiv Detail & Related papers (2020-12-03T18:59:43Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.