S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
- URL: http://arxiv.org/abs/2108.01072v1
- Date: Mon, 2 Aug 2021 17:59:02 GMT
- Title: S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
- Authors: Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li
- Abstract summary: Fuse-based vision architectures with less inductive bias achieve competitive performance in image recognition.
In this paper, we improve the S$2$-MLP vision backbone.
Our medium-scale model, S$2$-MLPv2-Medium achieves an $83.6%$ top-1 accuracy on the ImageNet-1K benchmark.
- Score: 34.47616917228978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, MLP-based vision backbones emerge. MLP-based vision architectures
with less inductive bias achieve competitive performance in image recognition
compared with CNNs and vision Transformers. Among them, spatial-shift MLP
(S$^2$-MLP), adopting the straightforward spatial-shift operation, achieves
better performance than the pioneering works including MLP-mixer and ResMLP.
More recently, using smaller patches with a pyramid structure, Vision
Permutator (ViP) and Global Filter Network (GFNet) achieve better performance
than S$^2$-MLP.
In this paper, we improve the S$^2$-MLP vision backbone. We expand the
feature map along the channel dimension and split the expanded feature map into
several parts. We conduct different spatial-shift operations on split parts.
Meanwhile, we exploit the split-attention operation to fuse these split
parts. Moreover, like the counterparts, we adopt smaller-scale patches and use
a pyramid structure for boosting the image recognition accuracy. We term the
improved spatial-shift MLP vision backbone as S$^2$-MLPv2. Using 55M
parameters, our medium-scale model, S$^2$-MLPv2-Medium achieves an $83.6\%$
top-1 accuracy on the ImageNet-1K benchmark using $224\times 224$ images
without self-attention and external training data.
Related papers
- R2-MLP: Round-Roll MLP for Multi-View 3D Object Recognition [33.53114929452528]
Vision architectures based exclusively on multi-layer perceptrons (MLPs) have gained much attention in the computer vision community.
We present an achieves a view-based 3D object recognition task by considering the communications between patches from different views.
With a conceptually simple structure, our R$2$MLP achieves competitive performance compared with existing methods.
arXiv Detail & Related papers (2022-11-20T21:13:02Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - CycleMLP: A MLP-like Architecture for Dense Prediction [26.74203747156439]
CycleMLP is a versatile backbone for visual recognition and dense predictions.
It can cope with various image sizes and achieves linear computational complexity to image size by using local windows.
CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for models.
arXiv Detail & Related papers (2021-07-21T17:23:06Z) - Vision Permutator: A Permutable MLP-Like Architecture for Visual
Recognition [185.80889967154963]
We present Vision Permutator, a conceptually simple and data efficient-like architecture for visual recognition.
By realizing the importance of the positional information carried by 2D feature representations, Vision Permutator encodes the feature representations along the height and width dimensions with linear projections.
We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers.
arXiv Detail & Related papers (2021-06-23T13:05:23Z) - S$^2$-MLP: Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
arXiv Detail & Related papers (2021-06-14T15:05:11Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.