MoCoViT: Mobile Convolutional Vision Transformer
- URL: http://arxiv.org/abs/2205.12635v2
- Date: Thu, 26 May 2022 13:40:26 GMT
- Title: MoCoViT: Mobile Convolutional Vision Transformer
- Authors: Hailong Ma, Xin Xia, Xing Wang, Xuefeng Xiao, Jiashi Li, Min Zheng
- Abstract summary: We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
- Score: 13.233314183471213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer networks have achieved impressive results on a variety
of vision tasks. However, most of them are computationally expensive and not
suitable for real-world mobile applications. In this work, we present Mobile
Convolutional Vision Transformer (MoCoViT), which improves in performance and
efficiency by introducing transformer into mobile convolutional networks to
leverage the benefits of both architectures. Different from recent works on
vision transformer, the mobile transformer block in MoCoViT is carefully
designed for mobile devices and is very lightweight, accomplished through two
primary modifications: the Mobile Self-Attention (MoSA) module and the Mobile
Feed Forward Network (MoFFN). MoSA simplifies the calculation of the attention
map through Branch Sharing scheme while MoFFN serves as a mobile version of MLP
in the transformer, further reducing the computation by a large margin.
Comprehensive experiments verify that our proposed MoCoViT family outperform
state-of-the-art portable CNNs and transformer neural architectures on various
vision tasks. On ImageNet classification, it achieves 74.5% top-1 accuracy at
147M FLOPs, gaining 1.2% over MobileNetV3 with less computations. And on the
COCO object detection task, MoCoViT outperforms GhostNet by 2.1 AP in RetinaNet
framework.
Related papers
- CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers.
We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$.
The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z) - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z) - Mobile-Former: Bridging MobileNet and Transformer [42.60008028063716]
We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between.
Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime.
arXiv Detail & Related papers (2021-08-12T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.