Separable Self-attention for Mobile Vision Transformers
- URL: http://arxiv.org/abs/2206.02680v1
- Date: Mon, 6 Jun 2022 15:31:35 GMT
- Title: Separable Self-attention for Mobile Vision Transformers
- Authors: Sachin Mehta and Mohammad Rastegari
- Abstract summary: This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$.
The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
- Score: 34.32399598443582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile vision transformers (MobileViT) can achieve state-of-the-art
performance across several mobile vision tasks, including classification and
detection. Though these models have fewer parameters, they have high latency as
compared to convolutional neural network-based models. The main efficiency
bottleneck in MobileViT is the multi-headed self-attention (MHA) in
transformers, which requires $O(k^2)$ time complexity with respect to the
number of tokens (or patches) $k$. Moreover, MHA requires costly operations
(e.g., batch-wise matrix multiplication) for computing self-attention,
impacting latency on resource-constrained devices. This paper introduces a
separable self-attention method with linear complexity, i.e. $O(k)$. A simple
yet effective characteristic of the proposed method is that it uses
element-wise operations for computing self-attention, making it a good choice
for resource-constrained devices. The improved model, MobileViTv2, is
state-of-the-art on several mobile vision tasks, including ImageNet object
classification and MS-COCO object detection. With about three million
parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet
dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster
on a mobile device.
Our source code is available at: \url{https://github.com/apple/ml-cvnets}
Related papers
- CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers.
We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - SwiftFormer: Efficient Additive Attention for Transformer-based
Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications.
We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device.
We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12.
We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.