Adaptive Attention Link-based Regularization for Vision Transformers
- URL: http://arxiv.org/abs/2211.13852v1
- Date: Fri, 25 Nov 2022 01:26:43 GMT
- Title: Adaptive Attention Link-based Regularization for Vision Transformers
- Authors: Heegon Jin, Jongwon Choi
- Abstract summary: We present a regularization technique to improve the training efficiency of Vision Transformers (ViT)
The trainable links are referred to as the attention augmentation module, which is trained simultaneously with ViT.
We can extract the relevant relationship between each CNN activation map and each ViT attention head, and based on this, we also propose an advanced attention augmentation module.
- Score: 6.6798113365140015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although transformer networks are recently employed in various vision tasks
with outperforming performance, extensive training data and a lengthy training
time are required to train a model to disregard an inductive bias. Using
trainable links between the channel-wise spatial attention of a pre-trained
Convolutional Neural Network (CNN) and the attention head of Vision
Transformers (ViT), we present a regularization technique to improve the
training efficiency of ViT. The trainable links are referred to as the
attention augmentation module, which is trained simultaneously with ViT,
boosting the training of ViT and allowing it to avoid the overfitting issue
caused by a lack of data. From the trained attention augmentation module, we
can extract the relevant relationship between each CNN activation map and each
ViT attention head, and based on this, we also propose an advanced attention
augmentation module. Consequently, even with a small amount of data, the
suggested method considerably improves the performance of ViT while achieving
faster convergence during training.
Related papers
- A General and Efficient Training for Transformer via Token Expansion [44.002355107931805]
Vision Transformers (ViTs) typically require an extremely large training cost.
Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method with accuracy dropping.
We propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs.
arXiv Detail & Related papers (2024-03-31T12:44:24Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - A Light Recipe to Train Robust Vision Transformers [34.51642006926379]
We show that Vision Transformers (ViTs) can serve as an underlying architecture for improving the robustness of machine learning models against evasion attacks.
We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset.
We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k.
arXiv Detail & Related papers (2022-09-15T16:00:04Z) - Towards Efficient Adversarial Training on Vision Transformers [41.6396577241957]
Adversarial training is one of the most effective ways to accomplish robust CNNs.
We propose an efficient Attention Guided Adversarial Training mechanism.
With only 65% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
arXiv Detail & Related papers (2022-07-21T14:23:50Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z) - TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation [54.61786380919243]
Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain.
Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations.
With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge remains unexplored in the literature.
arXiv Detail & Related papers (2021-08-12T22:37:43Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.