The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy
- URL: http://arxiv.org/abs/2203.06345v1
- Date: Sat, 12 Mar 2022 04:48:12 GMT
- Title: The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy
- Authors: Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang
- Abstract summary: This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
- Score: 111.49944789602884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have gained increasing popularity as they are
commonly believed to own higher modeling capacity and representation
flexibility, than traditional convolutional networks. However, it is
questionable whether such potential has been fully unleashed in practice, as
the learned ViTs often suffer from over-smoothening, yielding likely redundant
models. Recent works made preliminary attempts to identify and alleviate such
redundancy, e.g., via regularizing embedding similarity or re-injecting
convolution-like structures. However, a "head-to-toe assessment" regarding the
extent of redundancy in ViTs, and how much we could gain by thoroughly
mitigating such, has been absent for this field. This paper, for the first
time, systematically studies the ubiquitous existence of redundancy at all
three levels: patch embedding, attention map, and weight space. In view of
them, we advocate a principle of diversity for training ViTs, by presenting
corresponding regularizers that encourage the representation diversity and
coverage at each of those levels, that enabling capturing more discriminative
information. Extensive experiments on ImageNet with a number of ViT backbones
validate the effectiveness of our proposals, largely eliminating the observed
ViT redundancy and significantly boosting the model generalization. For
example, our diversified DeiT obtains 0.70%~1.76% accuracy boosts on ImageNet
with highly reduced similarity. Our codes are fully available in
https://github.com/VITA-Group/Diverse-ViT.
Related papers
- Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Delving Deep into the Generalization of Vision Transformers under
Distribution Shifts [59.93426322225099]
Vision Transformers (ViTs) have achieved impressive results on various vision tasks.
However, their generalization ability under different distribution shifts is rarely understood.
This work provides a comprehensive study on the out-of-distribution generalization of ViTs.
arXiv Detail & Related papers (2021-06-14T17:21:41Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.