EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications
- URL: http://arxiv.org/abs/2206.10589v1
- Date: Tue, 21 Jun 2022 17:59:56 GMT
- Title: EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications
- Authors: Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed
Waqas Zamir, Rao Muhammad Anwer, Fahad Shahbaz Khan
- Abstract summary: We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
- Score: 68.35683849098105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the pursuit of achieving ever-increasing accuracy, large and complex
neural networks are usually developed. Such models demand high computational
resources and therefore cannot be deployed on edge devices. It is of great
interest to build resource-efficient general purpose networks due to their
usefulness in several application areas. In this work, we strive to effectively
combine the strengths of both CNN and Transformer models and propose a new
efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce
split depth-wise transpose attention (SDTA) encoder that splits input tensors
into multiple channel groups and utilizes depth-wise convolution along with
self-attention across channel dimensions to implicitly increase the receptive
field and encode multi-scale features. Our extensive experiments on
classification, detection and segmentation tasks, reveal the merits of the
proposed approach, outperforming state-of-the-art methods with comparatively
lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves
71.2\% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute
gain of 2.2\% with 28\% reduction in FLOPs. Further, our EdgeNeXt model with
5.6M parameters achieves 79.4\% top-1 accuracy on ImageNet-1K. The code and
models are publicly available at https://t.ly/_Vu9.
Related papers
- ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - Semantic Segmentation in Satellite Hyperspectral Imagery by Deep Learning [54.094272065609815]
We propose a lightweight 1D-CNN model, 1D-Justo-LiuNet, which outperforms state-of-the-art models in the hypespectral domain.
1D-Justo-LiuNet achieves the highest accuracy (0.93) with the smallest model size (4,563 parameters) among all tested models.
arXiv Detail & Related papers (2023-10-24T21:57:59Z) - Efficient Deep Spiking Multi-Layer Perceptrons with Multiplication-Free Inference [13.924924047051782]
Deep convolution architectures for Spiking Neural Networks (SNNs) have significantly enhanced image classification performance and reduced computational burdens.
This research explores a new pathway, drawing inspiration from the progress made in Multi-Layer Perceptrons (MLPs)
We propose an innovative spiking architecture that uses batch normalization to retain MFI compatibility.
We establish an efficient multi-stage spiking network that blends effectively global receptive fields with local feature extraction.
arXiv Detail & Related papers (2023-06-21T16:52:20Z) - InceptionNeXt: When Inception Meets ConvNeXt [167.61042926444105]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance.
InceptionNeXt achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z) - Efficient CNN Architecture Design Guided by Visualization [13.074652653088584]
VGNetG-1.0MP achieves 67.7% top-1 accuracy with 0.99M parameters and 69.2% top-1 accuracy with 1.14M parameters on ImageNet classification dataset.
Our VGNetF-1.5MP archives 64.4%(-3.2%) top-1 accuracy and 66.2%(-1.4%) top-1 accuracy with additional Gaussian kernels.
arXiv Detail & Related papers (2022-07-21T06:22:15Z) - Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z) - Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations.
With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs.
With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.