EfficientFormer: Vision Transformers at MobileNet Speed
- URL: http://arxiv.org/abs/2206.01191v1
- Date: Thu, 2 Jun 2022 17:51:03 GMT
- Title: EfficientFormer: Vision Transformers at MobileNet Speed
- Authors: Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey
Tulyakov, Yanzhi Wang, Jian Ren
- Abstract summary: Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
- Score: 43.93223983817965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViT) have shown rapid progress in computer vision tasks,
achieving promising results on various benchmarks. However, due to the massive
number of parameters and model design, e.g., attention mechanism, ViT-based
models are generally times slower than lightweight convolutional networks.
Therefore, the deployment of ViT for real-time applications is particularly
challenging, especially on resource-constrained hardware such as mobile
devices. Recent efforts try to reduce the computation complexity of ViT through
network architecture search or hybrid design with MobileNet block, yet the
inference speed is still unsatisfactory. This leads to an important question:
can transformers run as fast as MobileNet while obtaining high performance? To
answer this, we first revisit the network architecture and operators used in
ViT-based models and identify inefficient designs. Then we introduce a
dimension-consistent pure transformer (without MobileNet blocks) as design
paradigm. Finally, we perform latency-driven slimming to get a series of final
models dubbed EfficientFormer. Extensive experiments show the superiority of
EfficientFormer in performance and speed on mobile devices. Our fastest model,
EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6
ms inference latency on iPhone 12 (compiled with CoreML), which is even a bit
faster than MobileNetV2 (1.7 ms, 71.8% top-1), and our largest model,
EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work
proves that properly designed transformers can reach extremely low latency on
mobile devices while maintaining high performance
Related papers
- FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications [7.2210216531805695]
Vision graph neural networks (ViGs) provide a new avenue for exploration.
ViGs are computationally expensive due to the overhead of representing images as graph structures.
We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
arXiv Detail & Related papers (2023-07-01T17:49:12Z) - SwiftFormer: Efficient Additive Attention for Transformer-based
Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications.
We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z) - FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device.
We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12.
We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.