Self-Distilled Vision Transformer for Domain Generalization
- URL: http://arxiv.org/abs/2207.12392v1
- Date: Mon, 25 Jul 2022 17:57:05 GMT
- Title: Self-Distilled Vision Transformer for Domain Generalization
- Authors: Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan,
Fahad Shahbaz Khan
- Abstract summary: Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
- Score: 58.76055100157651
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent past, several domain generalization (DG) methods have been
proposed, showing encouraging performance, however, almost all of them build on
convolutional neural networks (CNNs). There is little to no progress on
studying the DG performance of vision transformers (ViTs), which are
challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d
assumption. This renders the real-world deployment of ViTs doubtful. In this
paper, we attempt to explore ViTs towards addressing the DG problem. Similar to
CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit
is overfitting to source domains. Inspired by the modular architecture of ViTs,
we propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
It reduces the overfitting to source domains by easing the learning of
input-output mapping problem through curating non-zero entropy supervisory
signals for intermediate transformer blocks. Further, it does not introduce any
new parameters and can be seamlessly plugged into the modular composition of
different ViTs. We empirically demonstrate notable performance gains with
different DG baselines and various ViT backbones in five challenging datasets.
Moreover, we report favorable performance against recent state-of-the-art DG
methods. Our code along with pre-trained models are publicly available at:
https://github.com/maryam089/SDViT
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - $E(2)$-Equivariant Vision Transformer [11.94180035256023]
Vision Transformer (ViT) has achieved remarkable performance in computer vision.
positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data.
We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
arXiv Detail & Related papers (2023-06-11T16:48:03Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work [1.6317061277457001]
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs)
As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.
We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
arXiv Detail & Related papers (2022-03-03T06:17:03Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Improved Robustness of Vision Transformer via PreLayerNorm in Patch
Embedding [4.961852023598131]
Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs)
This paper studies the behavior and robustness of ViT.
arXiv Detail & Related papers (2021-11-16T12:32:03Z) - ViTGAN: Training GANs with Vision Transformers [46.769407314698434]
Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases.
We introduce several novel regularization techniques for training GANs with ViTs.
Our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets.
arXiv Detail & Related papers (2021-07-09T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.