Related papers: Rethinking Vision Transformers for MobileNet Size and Speed

Rethinking Vision Transformers for MobileNet Size and Speed

URL: http://arxiv.org/abs/2212.08059v2
Date: Mon, 4 Sep 2023 12:47:28 GMT
Title: Rethinking Vision Transformers for MobileNet Size and Speed
Authors: Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Abstract summary: We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
Score: 58.01406896628446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel fine-grained joint search strategy for transformer models that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve 3.5% higher top-1 accuracy than MobileNetV2 on ImageNet-1K with similar latency and parameters. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.

Related papers

iFormer: Integrating ConvNet and Transformer for Mobile Application [0.6798775532273751]
iFormer integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks.
arXiv Detail & Related papers (2025-01-26T02:34:58Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT. This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z)
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search [74.24393546346974]
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. There has been significant research recently on the design of efficient vision transformer architecture. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
arXiv Detail & Related papers (2023-08-22T13:08:29Z)
RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs) In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z)
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z)
EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. ViT-based models are generally times slower than lightweight convolutional networks. Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z)
MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks. MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications. Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [25.63398340113755]
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. We introduce the attention bias, a new way to integrate positional information in vision transformers. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff.
arXiv Detail & Related papers (2021-04-02T16:29:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.