Mobile-Former: Bridging MobileNet and Transformer
- URL: http://arxiv.org/abs/2108.05895v1
- Date: Thu, 12 Aug 2021 17:59:55 GMT
- Title: Mobile-Former: Bridging MobileNet and Transformer
- Authors: Yinpeng Chen and Xiyang Dai and Dongdong Chen and Mengchen Liu and
Xiaoyi Dong and Lu Yuan and Zicheng Liu
- Abstract summary: We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between.
Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime.
- Score: 42.60008028063716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Mobile-Former, a parallel design of MobileNet and Transformer with
a two-way bridge in between. This structure leverages the advantage of
MobileNet at local processing and transformer at global interaction. And the
bridge enables bidirectional fusion of local and global features. Different
with recent works on vision transformer, the transformer in Mobile-Former
contains very few tokens (e.g. less than 6 tokens) that are randomly
initialized, resulting in low computational cost. Combining with the proposed
light-weight cross attention to model the bridge, Mobile-Former is not only
computationally efficient, but also has more representation power,
outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet
classification. For instance, it achieves 77.9\% top-1 accuracy at 294M FLOPs,
gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When
transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6
AP.
Related papers
- Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device.
We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12.
We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z) - EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z) - EdgeFormer: Improving Light-weight ConvNets by Learning from Vision
Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model.
We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op.
Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z) - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.