Augmented Shortcuts for Vision Transformers
- URL: http://arxiv.org/abs/2106.15941v1
- Date: Wed, 30 Jun 2021 09:48:30 GMT
- Title: Augmented Shortcuts for Vision Transformers
- Authors: Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu, Yunhe
Wang
- Abstract summary: We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
- Score: 49.70151144700589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have achieved great progress on computer vision tasks
recently. The rapid development of vision transformers is mainly contributed by
their high representation ability for extracting informative features from
input images. However, the mainstream transformer models are designed with deep
architectures, and the feature diversity will be continuously reduced as the
depth increases, i.e., feature collapse. In this paper, we theoretically
analyze the feature collapse phenomenon and study the relationship between
shortcuts and feature diversity in these transformer models. Then, we present
an augmented shortcut scheme, which inserts additional paths with learnable
parameters in parallel on the original shortcuts. To save the computational
costs, we further explore an efficient approach that uses the block-circulant
projection to implement augmented shortcuts. Extensive experiments conducted on
benchmark datasets demonstrate the effectiveness of the proposed method, which
brings about 1% accuracy increase of the state-of-the-art visual transformers
without obviously increasing their parameters and FLOPs.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Depth Estimation with Simplified Transformer [4.565830918989131]
Transformer and its variants have shown state-of-the-art results in many vision tasks recently.
We propose a method for self-supervised monocular Depth Estimation with simplified Transformer (DEST)
Our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art.
arXiv Detail & Related papers (2022-04-28T21:39:00Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Visual Transformer Pruning [44.43429237788078]
We present an visual transformer pruning approach, which identifies the impacts of channels in each layer and then executes pruning accordingly.
The pipeline for visual transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning channels; 3) finetuning.
The reduced parameters and FLOPs ratios of the proposed algorithm are well evaluated and analyzed on ImageNet dataset to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-04-17T09:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.