A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP
- URL: http://arxiv.org/abs/2108.13002v1
- Date: Mon, 30 Aug 2021 06:09:02 GMT
- Title: A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP
- Authors: Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng,
Zheng-Jun Zha
- Abstract summary: Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
- Score: 121.35904748477421
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Convolutional neural networks (CNN) are the dominant deep neural network
(DNN) architecture for computer vision. Recently, Transformer and multi-layer
perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer,
started to lead new trends as they showed promising results in the ImageNet
classification task. In this paper, we conduct empirical studies on these DNN
structures and try to understand their respective pros and cons. To ensure a
fair comparison, we first develop a unified framework called SPACH which adopts
separate modules for spatial and channel processing. Our experiments under the
SPACH framework reveal that all structures can achieve competitive performance
at a moderate scale. However, they demonstrate distinctive behaviors when the
network size scales up. Based on our findings, we propose two hybrid models
using convolution and Transformer modules. The resulting Hybrid-MS-S+ model
achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is
already on par with the SOTA models with sophisticated designs. The code and
models will be made publicly available.
Related papers
- HyperKAN: Kolmogorov-Arnold Networks make Hyperspectral Image Classificators Smarter [0.0699049312989311]
We propose the replacement of linear and convolutional layers of traditional networks with KAN-based counterparts.
These modifications allowed us to significantly increase the per-pixel classification accuracy for hyperspectral remote-sensing images.
The greatest effect was achieved for convolutional networks working exclusively on spectral data.
arXiv Detail & Related papers (2024-07-07T06:36:09Z) - How to Learn More? Exploring Kolmogorov-Arnold Networks for Hyperspectral Image Classification [26.37105279142761]
Kolmogorov-Arnold Networks (KANs) were proposed as viable alternatives for vision transformers (ViTs)
In this study, we assess the effectiveness of KANs for complex hyperspectral image (HSI) data classification.
To enhance the HSI classification accuracy obtained by the KANs, we develop and propose a Hybrid architecture utilizing 1D, 2D, and 3D KANs.
arXiv Detail & Related papers (2024-06-22T03:31:02Z) - SENetV2: Aggregated dense layer for channelwise and global
representations [0.0]
We introduce a novel aggregated multilayer perceptron, a multi-branch dense layer, within the Squeeze residual module.
This fusion enhances the network's ability to capture channel-wise patterns and have global knowledge.
We conduct extensive experiments on benchmark datasets to validate the model and compare them with established architectures.
arXiv Detail & Related papers (2023-11-17T14:10:57Z) - SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in
Spiking Transformer [12.717450255837178]
Spiking Neural Network (SNN) has the advantages of low power consumption and high energy efficiency.
The most advanced SNN, Spikformer, combines the self-attention module from Transformer with SNN to achieve remarkable performance.
We present SparseSpikformer, a co-design framework aimed at achieving sparsity in Spikformer through token and weight pruning techniques.
arXiv Detail & Related papers (2023-11-15T09:22:52Z) - NAR-Former: Neural Architecture Representation Learning towards Holistic
Attributes Prediction [37.357949900603295]
We propose a neural architecture representation model that can be used to estimate attributes holistically.
Experiment results show that our proposed framework can be used to predict the latency and accuracy attributes of both cell architectures and whole deep neural networks.
arXiv Detail & Related papers (2022-11-15T10:15:21Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - Binarizing MobileNet via Evolution-based Searching [66.94247681870125]
We propose a use of evolutionary search to facilitate the construction and training scheme when binarizing MobileNet.
Inspired by one-shot architecture search frameworks, we manipulate the idea of group convolution to design efficient 1-Bit Convolutional Neural Networks (CNNs)
Our objective is to come up with a tiny yet efficient binary neural architecture by exploring the best candidates of the group convolution.
arXiv Detail & Related papers (2020-05-13T13:25:51Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.