Related papers: EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

URL: http://arxiv.org/abs/2206.10589v1
Date: Tue, 21 Jun 2022 17:59:56 GMT
Title: EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications
Authors: Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, Fahad Shahbaz Khan
Abstract summary: We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
Score: 68.35683849098105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2\% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2\% with 28\% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4\% top-1 accuracy on ImageNet-1K. The code and models are publicly available at https://t.ly/_Vu9.

Related papers

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE) ParFormer improves feature extraction by combining convolutional and attention mechanisms. For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S. The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z)
Semantic Segmentation in Satellite Hyperspectral Imagery by Deep Learning [54.094272065609815]
We propose a lightweight 1D-CNN model, 1D-Justo-LiuNet, which outperforms state-of-the-art models in the hypespectral domain. 1D-Justo-LiuNet achieves the highest accuracy (0.93) with the smallest model size (4,563 parameters) among all tested models.
arXiv Detail & Related papers (2023-10-24T21:57:59Z)
Efficient Deep Spiking Multi-Layer Perceptrons with Multiplication-Free Inference [13.924924047051782]
Deep convolution architectures for Spiking Neural Networks (SNNs) have significantly enhanced image classification performance and reduced computational burdens. This research explores a new pathway, drawing inspiration from the progress made in Multi-Layer Perceptrons (MLPs) We propose an innovative spiking architecture that uses batch normalization to retain MFI compatibility. We establish an efficient multi-stage spiking network that blends effectively global receptive fields with local feature extraction.
arXiv Detail & Related papers (2023-06-21T16:52:20Z)
InceptionNeXt: When Inception Meets ConvNeXt [167.61042926444105]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. InceptionNeXt achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z)
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z)
Efficient CNN Architecture Design Guided by Visualization [13.074652653088584]
VGNetG-1.0MP achieves 67.7% top-1 accuracy with 0.99M parameters and 69.2% top-1 accuracy with 1.14M parameters on ImageNet classification dataset. Our VGNetF-1.5MP archives 64.4%(-3.2%) top-1 accuracy and 66.2%(-1.4%) top-1 accuracy with additional Gaussian kernels.
arXiv Detail & Related papers (2022-07-21T06:22:15Z)
Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z)
Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations. With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs. With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.