How do SGD hyperparameters in natural training affect adversarial
robustness?
- URL: http://arxiv.org/abs/2006.11604v1
- Date: Sat, 20 Jun 2020 16:04:44 GMT
- Title: How do SGD hyperparameters in natural training affect adversarial
robustness?
- Authors: Sandesh Kamath, Amit Deshpande, K V Subrahmanyam
- Abstract summary: Learning rate, batch size and momentum are three important hyper parameters in the SGD algorithm.
In this paper, we empirically observe the effect of the SGD hyper parameters on the accuracy and adversarial robustness of networks trained with unperturbed samples.
- Score: 5.406299794900294
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Learning rate, batch size and momentum are three important hyperparameters in
the SGD algorithm. It is known from the work of Jastrzebski et al.
arXiv:1711.04623 that large batch size training of neural networks yields
models which do not generalize well. Yao et al. arXiv:1802.08241 observe that
large batch training yields models that have poor adversarial robustness. In
the same paper, the authors train models with different batch sizes and compute
the eigenvalues of the Hessian of loss function. They observe that as the batch
size increases, the dominant eigenvalues of the Hessian become larger. They
also show that both adversarial training and small-batch training leads to a
drop in the dominant eigenvalues of the Hessian or lowering its spectrum. They
combine adversarial training and second order information to come up with a new
large-batch training algorithm and obtain robust models with good
generalization. In this paper, we empirically observe the effect of the SGD
hyperparameters on the accuracy and adversarial robustness of networks trained
with unperturbed samples. Jastrzebski et al. considered training models with a
fixed learning rate to batch size ratio. They observed that higher the ratio,
better is the generalization. We observe that networks trained with constant
learning rate to batch size ratio, as proposed in Jastrzebski et al., yield
models which generalize well and also have almost constant adversarial
robustness, independent of the batch size. We observe that momentum is more
effective with varying batch sizes and a fixed learning rate than with constant
learning rate to batch size ratio based SGD training.
Related papers
- Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - The curse of overparametrization in adversarial training: Precise
analysis of robust generalization for random features regression [34.35440701530876]
We show that for adversarially trained random features models, high overparametrization can hurt robust generalization.
Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
arXiv Detail & Related papers (2022-01-13T18:57:30Z) - Understanding the Logit Distributions of Adversarially-Trained Deep
Neural Networks [6.439477789066243]
Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks.
Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood.
We identify three logit characteristics essential to learning adversarial robustness.
arXiv Detail & Related papers (2021-08-26T19:09:15Z) - Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning.
In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z) - On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks.
We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.