Understanding the robustness difference between stochastic gradient
descent and adaptive gradient methods
- URL: http://arxiv.org/abs/2308.06703v2
- Date: Tue, 28 Nov 2023 22:37:39 GMT
- Title: Understanding the robustness difference between stochastic gradient
descent and adaptive gradient methods
- Authors: Avery Ma, Yangchen Pan and Amir-massoud Farahmand
- Abstract summary: gradient descent (SGD) and adaptive gradient methods have been widely used in training deep neural networks.
We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations.
- Score: 11.895321856533934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam
and RMSProp, have been widely used in training deep neural networks. We
empirically show that while the difference between the standard generalization
performance of models trained using these methods is small, those trained using
SGD exhibit far greater robustness under input perturbations. Notably, our
investigation demonstrates the presence of irrelevant frequencies in natural
datasets, where alterations do not affect models' generalization performance.
However, models trained with adaptive methods show sensitivity to these
changes, suggesting that their use of irrelevant frequencies can lead to
solutions sensitive to perturbations. To better understand this difference, we
study the learning dynamics of gradient descent (GD) and sign gradient descent
(signGD) on a synthetic dataset that mirrors natural signals. With a
three-dimensional input space, the models optimized with GD and signGD have
standard risks close to zero but vary in their adversarial risks. Our result
shows that linear models' robustness to $\ell_2$-norm bounded changes is
inversely proportional to the model parameters' weight norm: a smaller weight
norm implies better robustness. In the context of deep learning, our
experiments show that SGD-trained neural networks have smaller Lipschitz
constants, explaining the better robustness to input perturbations than those
trained with adaptive gradient methods.
Related papers
- Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems.
Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z) - Domain Generalization Guided by Gradient Signal to Noise Ratio of
Parameters [69.24377241408851]
Overfitting to the source domain is a common issue in gradient-based training of deep neural networks.
We propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters.
arXiv Detail & Related papers (2023-10-11T10:21:34Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Dissecting adaptive methods in GANs [46.90376306847234]
We study how adaptive methods help train generative adversarial networks (GANs)
By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.
We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse.
arXiv Detail & Related papers (2022-10-09T19:00:07Z) - Scaling Private Deep Learning with Low-Rank and Sparse Gradients [5.14780936727027]
We propose a framework that exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates.
A novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates.
Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2022-07-06T14:09:47Z) - Differentially private training of neural networks with Langevin
dynamics forcalibrated predictive uncertainty [58.730520380312676]
We show that differentially private gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models.
This represents a serious issue for safety-critical applications, e.g. in medical diagnosis.
arXiv Detail & Related papers (2021-07-09T08:14:45Z) - A Distributed Optimisation Framework Combining Natural Gradient with
Hessian-Free for Discriminative Sequence Training [16.83036203524611]
This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training.
It relies on the linear conjugate gradient (CG) algorithm to combine the natural gradient (NG) method with local curvature information from Hessian-free (HF) or other second-order methods.
Experiments are reported on the multi-genre broadcast data set for a range of different acoustic model types.
arXiv Detail & Related papers (2021-03-12T22:18:34Z) - Inductive Bias of Gradient Descent for Exponentially Weight Normalized
Smooth Homogeneous Neural Nets [1.7259824817932292]
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss.
This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate.
arXiv Detail & Related papers (2020-10-24T14:34:56Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.