SepViT: Separable Vision Transformer
- URL: http://arxiv.org/abs/2203.15380v4
- Date: Thu, 15 Jun 2023 16:37:26 GMT
- Title: SepViT: Separable Vision Transformer
- Authors: Wei Li, Xing Wang, Xin Xia, Jie Wu, Jiashi Li, Xuefeng Xiao, Min
Zheng, Shiping Wen
- Abstract summary: Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices.
We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT.
SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
- Score: 20.403430632658946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have witnessed prevailing success in a series of vision
tasks. However, these Transformers often rely on extensive computational costs
to achieve high performance, which is burdensome to deploy on
resource-constrained devices. To alleviate this issue, we draw lessons from
depthwise separable convolution and imitate its ideology to design an efficient
Transformer backbone, i.e., Separable Vision Transformer, abbreviated as
SepViT. SepViT helps to carry out the local-global information interaction
within and among the windows in sequential order via a depthwise separable
self-attention. The novel window token embedding and grouped self-attention are
employed to compute the attention relationship among windows with negligible
cost and establish long-range visual interactions across multiple windows,
respectively. Extensive experiments on general-purpose vision benchmarks
demonstrate that SepViT can achieve a state-of-the-art trade-off between
performance and latency. Among them, SepViT achieves 84.2% top-1 accuracy on
ImageNet-1K classification while decreasing the latency by 40%, compared to the
ones with similar accuracy (e.g., CSWin). Furthermore, SepViT achieves 51.0%
mIoU on ADE20K semantic segmentation task, 47.9 AP on the RetinaNet-based COCO
detection task, 49.4 box AP and 44.6 mask AP on Mask R-CNN-based COCO object
detection and instance segmentation tasks.
Related papers
- HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Next-ViT: Next Generation Vision Transformer for Efficient Deployment in
Realistic Industrial Scenarios [19.94294348122248]
Most vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios.
We propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT.
Next-ViT dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off.
arXiv Detail & Related papers (2022-07-12T12:50:34Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - What Makes for Hierarchical Vision Transformer? [46.848348453909495]
We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged.
The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs.
arXiv Detail & Related papers (2021-07-05T17:59:35Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.