S$^2$-MLP: Spatial-Shift MLP Architecture for Vision
- URL: http://arxiv.org/abs/2106.07477v1
- Date: Mon, 14 Jun 2021 15:05:11 GMT
- Title: S$^2$-MLP: Spatial-Shift MLP Architecture for Vision
- Authors: Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li
- Abstract summary: Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
- Score: 34.47616917228978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, visual Transformer (ViT) and its following works abandon the
convolution and exploit the self-attention operation, attaining a comparable or
even higher accuracy than CNN. More recently, MLP-Mixer abandons both the
convolution and the self-attention operation, proposing an architecture
containing only MLP layers. To achieve cross-patch communications, it devises
an additional token-mixing MLP besides the channel-mixing MLP. It achieves
promising results when training on an extremely large-scale dataset. But it
cannot achieve as outstanding performance as its CNN and ViT counterparts when
training on medium-scale datasets such as ImageNet1K and ImageNet21K. The
performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We
discover that token-mixing operation in MLP-Mixer is a variant of depthwise
convolution with a global reception field and spatial-specific configuration.
But the global reception field and the spatial-specific property make
token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure
MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our
S$^2$-MLP only contains channel-mixing MLP. We devise a spatial-shift operation
for achieving the communication between patches. It has a local reception field
and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for
computation. The proposed S$^2$-MLP attains higher recognition accuracy than
MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP
accomplishes as excellent performance as ViT on ImageNet-1K dataset with
considerably simpler architecture and fewer FLOPs and parameters.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.