MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer
- URL: http://arxiv.org/abs/2110.02178v1
- Date: Tue, 5 Oct 2021 17:07:53 GMT
- Title: MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer
- Authors: Sachin Mehta and Mohammad Rastegari
- Abstract summary: We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
- Score: 24.47196590256829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Light-weight convolutional neural networks (CNNs) are the de-facto for mobile
vision tasks. Their spatial inductive biases allow them to learn
representations with fewer parameters across different vision tasks. However,
these networks are spatially local. To learn global representations,
self-attention-based vision trans-formers (ViTs) have been adopted. Unlike
CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is
it possible to combine the strengths of CNNs and ViTs to build a light-weight
and low latency network for mobile vision tasks? Towards this end, we introduce
MobileViT, a light-weight and general-purpose vision transformer for mobile
devices. MobileViT presents a different perspective for the global processing
of information with transformers, i.e., transformers as convolutions. Our
results show that MobileViT significantly outperforms CNN- and ViT-based
networks across different tasks and datasets. On the ImageNet-1k dataset,
MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters,
which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT
(ViT-based) for a similar number of parameters. On the MS-COCO object detection
task, MobileViT is 5.7% more accurate than Mo-bileNetv3 for a similar number of
parameters.
Related papers
- Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications [7.2210216531805695]
Vision graph neural networks (ViGs) provide a new avenue for exploration.
ViGs are computationally expensive due to the overhead of representing images as graph structures.
We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
arXiv Detail & Related papers (2023-07-01T17:49:12Z) - Vision Transformers for Mobile Applications: A Short Survey [0.0]
Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks.
deploying large-scale ViTs is resource-consuming and impossible for many mobile devices.
We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer.
arXiv Detail & Related papers (2023-05-30T19:12:08Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.