RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization
- URL: http://arxiv.org/abs/2406.16004v2
- Date: Sat, 20 Jul 2024 03:45:15 GMT
- Title: RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization
- Authors: Mingshu Zhao, Yi Luo, Yong Ouyang,
- Abstract summary: lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are favored for their parameter efficiency and low latency.
This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications.
- Score: 8.346566205092433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3\% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP$^{box}$ by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at https://github.com/suous/RepNeXt.
Related papers
- CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features.
This integration enables efficient processing of detailed local and broader contextual information.
Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z) - FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - Lightweight Real-time Semantic Segmentation Network with Efficient
Transformer and CNN [34.020978009518245]
We propose a lightweight real-time semantic segmentation network called LETNet.
LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies.
Experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance.
arXiv Detail & Related papers (2023-02-21T07:16:53Z) - Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.