Delving Deep into the Generalization of Vision Transformers under
Distribution Shifts
- URL: http://arxiv.org/abs/2106.07617v1
- Date: Mon, 14 Jun 2021 17:21:41 GMT
- Title: Delving Deep into the Generalization of Vision Transformers under
Distribution Shifts
- Authors: Chongzhi Zhang, Mingyuan Zhang, Shanghang Zhang, Daisheng Jin, Qiang
Zhou, Zhongang Cai, Haiyu Zhao, Shuai Yi, Xianglong Liu, Ziwei Liu
- Abstract summary: Vision Transformers (ViTs) have achieved impressive results on various vision tasks.
However, their generalization ability under different distribution shifts is rarely understood.
This work provides a comprehensive study on the out-of-distribution generalization of ViTs.
- Score: 59.93426322225099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vision Transformers (ViTs) have achieved impressive results on
various vision tasks. Yet, their generalization ability under different
distribution shifts is rarely understood. In this work, we provide a
comprehensive study on the out-of-distribution generalization of ViTs. To
support a systematic investigation, we first present a taxonomy of distribution
shifts by categorizing them into five conceptual groups: corruption shift,
background shift, texture shift, destruction shift, and style shift. Then we
perform extensive evaluations of ViT variants under different groups of
distribution shifts and compare their generalization ability with CNNs. Several
important observations are obtained: 1) ViTs generalize better than CNNs under
multiple distribution shifts. With the same or fewer parameters, ViTs are ahead
of corresponding CNNs by more than 5% in top-1 accuracy under most distribution
shifts. 2) Larger ViTs gradually narrow the in-distribution and
out-of-distribution performance gap. To further improve the generalization of
ViTs, we design the Generalization-Enhanced ViTs by integrating adversarial
learning, information theory, and self-supervised learning. By investigating
three types of generalization-enhanced ViTs, we observe their
gradient-sensitivity and design a smoother learning strategy to achieve a
stable training process. With modified training schemes, we achieve
improvements on performance towards out-of-distribution data by 4% from vanilla
ViTs. We comprehensively compare three generalization-enhanced ViTs with their
corresponding CNNs, and observe that: 1) For the enhanced model, larger ViTs
still benefit more for the out-of-distribution generalization. 2)
generalization-enhanced ViTs are more sensitive to the hyper-parameters than
corresponding CNNs. We hope our comprehensive study could shed light on the
design of more generalizable learning architectures.
Related papers
- Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy [111.49944789602884]
This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
arXiv Detail & Related papers (2022-03-12T04:48:12Z) - How to augment your ViTs? Consistency loss and StyleAug, a random style
transfer augmentation [4.3012765978447565]
The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks.
One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-12-16T23:56:04Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.