How to augment your ViTs? Consistency loss and StyleAug, a random style
transfer augmentation
- URL: http://arxiv.org/abs/2112.09260v1
- Date: Thu, 16 Dec 2021 23:56:04 GMT
- Title: How to augment your ViTs? Consistency loss and StyleAug, a random style
transfer augmentation
- Authors: Akash Umakantha, Joao D. Semedo, S. Alireza Golestaneh, Wan-Yi S. Lin
- Abstract summary: The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks.
One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs)
- Score: 4.3012765978447565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Vision Transformer (ViT) architecture has recently achieved competitive
performance across a variety of computer vision tasks. One of the motivations
behind ViTs is weaker inductive biases, when compared to convolutional neural
networks (CNNs). However this also makes ViTs more difficult to train. They
require very large training datasets, heavy regularization, and strong data
augmentations. The data augmentation strategies used to train ViTs have largely
been inherited from CNN training, despite the significant differences between
the two architectures. In this work, we empirical evaluated how different data
augmentation strategies performed on CNN (e.g., ResNet) versus ViT
architectures for image classification. We introduced a style transfer data
augmentation, termed StyleAug, which worked best for training ViTs, while
RandAugment and Augmix typically worked best for training CNNs. We also found
that, in addition to a classification loss, using a consistency loss between
multiple augmentations of the same image was especially helpful when training
ViTs.
Related papers
- Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Delving Deep into the Generalization of Vision Transformers under
Distribution Shifts [59.93426322225099]
Vision Transformers (ViTs) have achieved impressive results on various vision tasks.
However, their generalization ability under different distribution shifts is rarely understood.
This work provides a comprehensive study on the out-of-distribution generalization of ViTs.
arXiv Detail & Related papers (2021-06-14T17:21:41Z) - Reveal of Vision Transformers Robustness against Adversarial Attacks [13.985121520800215]
This work studies the robustness of ViT variants against different $L_p$-based adversarial attacks in comparison with CNNs.
We provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs.
arXiv Detail & Related papers (2021-06-07T15:59:49Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.