Related papers: Pyramid Adversarial Training Improves ViT Performance

Pyramid Adversarial Training Improves ViT Performance

URL: http://arxiv.org/abs/2111.15121v1
Date: Tue, 30 Nov 2021 04:38:14 GMT
Title: Pyramid Adversarial Training Improves ViT Performance
Authors: Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, Deqing Sun
Abstract summary: Pyramid Adversarial Training is a simple and effective technique to improve ViT's overall performance. It leads to $1.82%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data.
Score: 43.322865996422664
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training; however, many prior works have shown that this often results in poor clean accuracy. In this work, we present Pyramid Adversarial Training, a simple and effective technique to improve ViT's overall performance. We pair it with a "matched" Dropout and stochastic depth regularization, which adopts the same Dropout and stochastic depth configuration for the clean and adversarial samples. Similar to the improvements on CNNs by AdvProp (not directly applicable to ViT), our Pyramid Adversarial Training breaks the trade-off between in-distribution accuracy and out-of-distribution robustness for ViT and related architectures. It leads to $1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data, while simultaneously boosting performance on $7$ ImageNet robustness metrics, by absolute numbers ranging from $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4 mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extra data, using only the ViT-B/16 backbone and our Pyramid Adversarial Training. Our code will be publicly available upon acceptance.

Related papers

Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models [52.86163536826919]
We revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust generalizations across different ranges of model parameters. Our ViT + ConvStem yields the best generalization to unseen threat models.
arXiv Detail & Related papers (2023-03-03T11:53:01Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively. We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z)
Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z)
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures. This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference. We show that the improved robustness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
Shape-Texture Debiased Neural Network Training [50.6178024087048]
Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. We develop an algorithm for shape-texture debiased learning. Experiments show that our method successfully improves model performance on several image recognition benchmarks.
arXiv Detail & Related papers (2020-10-12T19:16:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.