Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision
Transformers
- URL: http://arxiv.org/abs/2310.05642v1
- Date: Mon, 9 Oct 2023 11:56:35 GMT
- Title: Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision
Transformers
- Authors: Xuwei Xu, Sen Wang, Yudong Chen, Jiajun Liu
- Abstract summary: Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks.
High computational complexity hinders ViTs' applicability on devices with limited memory and computing resources.
We propose a novel channel shuffle module to improve tiny-size ViTs.
- Score: 15.108494142240993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance in
various computer vision tasks. However, the high computational complexity
hinders ViTs' applicability on devices with limited memory and computing
resources. Although certain investigations have delved into the fusion of
convolutional layers with self-attention mechanisms to enhance the efficiency
of ViTs, there remains a knowledge gap in constructing tiny yet effective ViTs
solely based on the self-attention mechanism. Furthermore, the straightforward
strategy of reducing the feature channels in a large but outperforming ViT
often results in significant performance degradation despite improved
efficiency. To address these challenges, we propose a novel channel shuffle
module to improve tiny-size ViTs, showing the potential of pure self-attention
models in environments with constrained computing resources. Inspired by the
channel shuffle design in ShuffleNetV2 \cite{ma2018shufflenet}, our module
expands the feature channels of a tiny ViT and partitions the channels into two
groups: the \textit{Attended} and \textit{Idle} groups. Self-attention
computations are exclusively employed on the designated \textit{Attended}
group, followed by a channel shuffle operation that facilitates information
exchange between the two groups. By incorporating our module into a tiny ViT,
we can achieve superior performance while maintaining a comparable
computational complexity to the vanilla model. Specifically, our proposed
channel shuffle module consistently improves the top-1 accuracy on the
ImageNet-1K dataset for various tiny ViT models by up to 2.8\%, with the
changes in model complexity being less than 0.03 GMACs.
Related papers
- Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
Visual perception tasks are predominantly solved by ViT, despite their effectiveness.
Despite their effectiveness, ViT encounters a computational bottleneck due to the complexity of computing self-attention.
We propose Fibottention architecture, which approximating self-attention that is built upon.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.