Pyramid Adversarial Training Improves ViT Performance
- URL: http://arxiv.org/abs/2111.15121v1
- Date: Tue, 30 Nov 2021 04:38:14 GMT
- Title: Pyramid Adversarial Training Improves ViT Performance
- Authors: Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang,
Ce Liu, Dilip Krishnan, Deqing Sun
- Abstract summary: Pyramid Adversarial Training is a simple and effective technique to improve ViT's overall performance.
It leads to $1.82%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data.
- Score: 43.322865996422664
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Aggressive data augmentation is a key component of the strong generalization
capabilities of Vision Transformer (ViT). One such data augmentation technique
is adversarial training; however, many prior works have shown that this often
results in poor clean accuracy. In this work, we present Pyramid Adversarial
Training, a simple and effective technique to improve ViT's overall
performance. We pair it with a "matched" Dropout and stochastic depth
regularization, which adopts the same Dropout and stochastic depth
configuration for the clean and adversarial samples. Similar to the
improvements on CNNs by AdvProp (not directly applicable to ViT), our Pyramid
Adversarial Training breaks the trade-off between in-distribution accuracy and
out-of-distribution robustness for ViT and related architectures. It leads to
$1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B model
when trained only on ImageNet-1K data, while simultaneously boosting
performance on $7$ ImageNet robustness metrics, by absolute numbers ranging
from $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4
mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extra
data, using only the ViT-B/16 backbone and our Pyramid Adversarial Training.
Our code will be publicly available upon acceptance.
Related papers
- Revisiting Adversarial Training for ImageNet: Architectures, Training
and Generalization across Threat Models [52.86163536826919]
We revisit adversarial training on ImageNet comparing ViTs and ConvNeXts.
Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust generalizations across different ranges of model parameters.
Our ViT + ConvStem yields the best generalization to unseen threat models.
arXiv Detail & Related papers (2023-03-03T11:53:01Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting.
This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy.
Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Shape-Texture Debiased Neural Network Training [50.6178024087048]
Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset.
We develop an algorithm for shape-texture debiased learning.
Experiments show that our method successfully improves model performance on several image recognition benchmarks.
arXiv Detail & Related papers (2020-10-12T19:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.