Are we ready for a new paradigm shift? A Survey on Visual Deep MLP
- URL: http://arxiv.org/abs/2111.04060v1
- Date: Sun, 7 Nov 2021 12:02:00 GMT
- Title: Are we ready for a new paradigm shift? A Survey on Visual Deep MLP
- Authors: Ruiyang Liu, Yinghui Li, Dun Liang, Linmi Tao, Shimin Hu, Hai-Tao
Zheng
- Abstract summary: Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit.
constrained by the hardware computing power and the size of the datasets, it once sank for tens of years.
We have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive fields, and further to the Transform with global receptive fields.
- Score: 33.00328314841369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilayer perceptron (MLP), as the first neural network structure to appear,
was a big hit. But constrained by the hardware computing power and the size of
the datasets, it once sank for tens of years. During this period, we have
witnessed a paradigm shift from manual feature extraction to the CNN with local
receptive fields, and further to the Transform with global receptive fields
based on self-attention mechanism. And this year (2021), with the introduction
of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive
research from the computer vision community. Compare to the conventional MLP,
it gets deeper but changes the input from full flattening to patch flattening.
Given its high performance and less need for vision-specific inductive bias,
the community can't help but wonder, Will MLP, the simplest structure with
global receptive fields but no attention, become a new computer vision
paradigm? To answer this question, this survey aims to provide a comprehensive
overview of the recent development of vision deep MLP models. Specifically, we
review these vision deep MLPs detailedly, from the subtle sub-module design to
the global network structure. We compare the receptive field, computational
complexity, and other properties of different network designs in order to have
a clear understanding of the development path of MLPs. The investigation shows
that MLPs' resolution-sensitivity and computational densities remain
unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest
that the current data volume and computational power are not ready to embrace
pure MLPs, and artificial visual guidance remains important. Finally, we
provide an analysis of open research directions and possible future works. We
hope this effort will ignite further interest in the community and encourage
better visual tailored design for the neural network at the moment.
Related papers
- X-MLP: A Patch Embedding-Free MLP Architecture for Vision [4.493200639605705]
Multi-layer perceptron (MLP) architectures for vision have been popular again.
We propose X-MLP, an architecture constructed absolutely upon fully connected layers and free from patch embedding.
X-MLP is tested on ten benchmark datasets, all better performance than other vision models.
arXiv Detail & Related papers (2023-07-02T15:20:25Z) - GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation [68.65764751482774]
GraphMLP is a global-local-graphical unified architecture for 3D human pose estimation.
It incorporates the graph structure of human bodies into a model to meet the domain-specific demand of the 3D human pose.
It can be extended to model complex temporal dynamics in a simple way with negligible computational cost gains in the sequence length.
arXiv Detail & Related papers (2022-06-13T18:59:31Z) - MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing [123.43419144051703]
We present a novel-like 3D architecture for video recognition.
The results are comparable to state-of-the-art widely-used 3D CNNs and video.
arXiv Detail & Related papers (2022-06-13T16:21:33Z) - MDMLP: Image Classification from Scratch on Small Datasets with MLP [7.672827879118106]
Recently, the attention mechanism has become a go-to technique for natural language processing and computer vision tasks.
Recently, theMixer and other-based architectures, based simply on multi-layer perceptrons (MLPs), are also powerful compared to CNNs and attention techniques.
arXiv Detail & Related papers (2022-05-28T16:26:59Z) - ActiveMLP: An MLP-like Architecture with Active Token Mixer [54.95923719553343]
This paper presents ActiveMLP, a general-like backbone for computer vision.
We propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one.
In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed.
arXiv Detail & Related papers (2022-03-11T17:29:54Z) - Convolutional Gated MLP: Combining Convolutions & gMLP [0.0]
This paper introduces Convolutions to Gated MultiLayer Perceptron.
Google Brain introduced the gMLP in May 2021; Microsoft introduced Convolutions in Vision Transformer in Mar 2021.
Inspired by both gMLP and CvT, we introduce convolutional layers in gMLP.
arXiv Detail & Related papers (2021-11-06T19:11:24Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - Hire-MLP: Vision MLP via Hierarchical Rearrangement [58.33383667626998]
Hire-MLP is a simple yet competitive vision architecture via rearrangement.
The proposed Hire-MLP architecture is built with simple channel-mixing operations, thus enjoys high flexibility and inference speed.
Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark.
arXiv Detail & Related papers (2021-08-30T16:11:04Z) - RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? [0.0]
CNN has reigned supreme in the world of computer vision for the past ten years, but recently, Transformer is on the rise.
In particular, our work indicates that models have the potential to replace CNNs by adopting inductive bias.
The proposed model, named RaftMLP, has a good balance of computational complexity, the number of parameters, and actual memory usage.
arXiv Detail & Related papers (2021-08-09T23:55:24Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.