Robust Training of Neural Networks using Scale Invariant Architectures
- URL: http://arxiv.org/abs/2202.00980v1
- Date: Wed, 2 Feb 2022 11:58:56 GMT
- Title: Robust Training of Neural Networks using Scale Invariant Architectures
- Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi,
Sanjiv Kumar
- Abstract summary: In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
- Score: 70.67803417918854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to SGD, adaptive gradient methods like Adam allow robust training
of modern deep networks, especially large language models. However, the use of
adaptivity not only comes at the cost of extra memory but also raises the
fundamental question: can non-adaptive methods like SGD enjoy similar benefits?
In this paper, we provide an affirmative answer to this question by proposing
to achieve both robust and memory-efficient training via the following general
recipe: (1) modify the architecture and make it scale invariant, i.e. the scale
of parameter doesn't affect the output of the network, (2) train with SGD and
weight decay, and optionally (3) clip the global gradient norm proportional to
weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is
learning rate and $\lambda$ is weight decay. We show that this general approach
is robust to rescaling of parameter and loss by proving that its convergence
only depends logarithmically on the scale of initialization and loss, whereas
the standard SGD might not even converge for many initializations. Following
our recipe, we design a scale invariant version of BERT, called SIBERT, which
when trained simply by vanilla SGD achieves performance comparable to BERT
trained by adaptive methods like Adam on downstream tasks.
Related papers
- Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Dissecting adaptive methods in GANs [46.90376306847234]
We study how adaptive methods help train generative adversarial networks (GANs)
By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.
We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse.
arXiv Detail & Related papers (2022-10-09T19:00:07Z) - Biologically Plausible Training Mechanisms for Self-Supervised Learning
in Deep Networks [14.685237010856953]
We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks.
We show that learning can be performed with one of two more plausible alternatives to backpagation.
arXiv Detail & Related papers (2021-09-30T12:56:57Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.