RepViT: Revisiting Mobile CNN From ViT Perspective
- URL: http://arxiv.org/abs/2307.09283v8
- Date: Thu, 14 Mar 2024 08:28:13 GMT
- Title: RepViT: Revisiting Mobile CNN From ViT Perspective
- Authors: Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding,
- Abstract summary: lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
- Score: 67.05569159984691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.
Related papers
- Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z) - RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization [8.346566205092433]
lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are favored for their parameter efficiency and low latency.
This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications.
arXiv Detail & Related papers (2024-06-23T04:11:12Z) - RepViT-SAM: Towards Real-Time Segmenting Anything [71.94042743317937]
Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks.
MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation.
RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10times$ faster inference speed.
arXiv Detail & Related papers (2023-12-10T04:42:56Z) - MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications [7.2210216531805695]
Vision graph neural networks (ViGs) provide a new avenue for exploration.
ViGs are computationally expensive due to the overhead of representing images as graph structures.
We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
arXiv Detail & Related papers (2023-07-01T17:49:12Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices.
Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.