Related papers: EfficientFormer: Vision Transformers at MobileNet Speed

EfficientFormer: Vision Transformers at MobileNet Speed

URL: http://arxiv.org/abs/2206.01191v1
Date: Thu, 2 Jun 2022 17:51:03 GMT
Title: EfficientFormer: Vision Transformers at MobileNet Speed
Authors: Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren
Abstract summary: Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. ViT-based models are generally times slower than lightweight convolutional networks. Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
Score: 43.93223983817965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which is even a bit faster than MobileNetV2 (1.7 ms, 71.8% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance

Related papers

RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone [6.4399181389092]
We propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2024-12-14T23:39:03Z)
FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT. This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z)
RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs) In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z)
MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications [7.2210216531805695]
Vision graph neural networks (ViGs) provide a new avenue for exploration. ViGs are computationally expensive due to the overhead of representing images as graph structures. We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
arXiv Detail & Related papers (2023-07-01T17:49:12Z)
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z)
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device. We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training. As-ViT automatically discovers and scales up ViTs in an efficient and principled manner. As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.