MobileViTv3: Mobile-Friendly Vision Transformer with Simple and
Effective Fusion of Local, Global and Input Features
- URL: http://arxiv.org/abs/2209.15159v1
- Date: Fri, 30 Sep 2022 01:04:10 GMT
- Title: MobileViTv3: Mobile-Friendly Vision Transformer with Simple and
Effective Fusion of Local, Global and Input Features
- Authors: Shakti N. Wadekar and Abhishek Chaurasia
- Abstract summary: MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks.
We propose changes to the fusion block that are simple and effective to create MobileViTv3-block.
Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and
vision transformers (ViTs) to create light-weight models for mobile vision
tasks. Though the main MobileViTv1-block helps to achieve competitive
state-of-the-art results, the fusion block inside MobileViTv1-block, creates
scaling challenges and has a complex learning task. We propose changes to the
fusion block that are simple and effective to create MobileViTv3-block, which
addresses the scaling and simplifies the learning task. Our proposed
MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform
MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On
ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and
MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2
architecture removes fusion block and uses linear complexity transformers to
perform better than MobileViTv1. We add our proposed fusion block to
MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models
give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012
datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75
outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively
on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07%
and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and
PascalVOC2012 dataset respectively. Our code and the trained models are
available at: https://github.com/micronDLA/MobileViTv3
Related papers
- Scaling Graph Convolutions for Mobile Vision [6.4399181389092]
This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem.
Our proposed mobile vision architecture, MobileViGv2, uses MGC to demonstrate the effectiveness of our approach.
Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency.
arXiv Detail & Related papers (2024-06-09T16:49:19Z) - Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding [81.1943823985213]
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
arXiv Detail & Related papers (2023-12-27T08:52:41Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications [7.2210216531805695]
Vision graph neural networks (ViGs) provide a new avenue for exploration.
ViGs are computationally expensive due to the overhead of representing images as graph structures.
We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
arXiv Detail & Related papers (2023-07-01T17:49:12Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$.
The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.