Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs
- URL: http://arxiv.org/abs/2202.06510v1
- Date: Mon, 14 Feb 2022 06:53:48 GMT
- Title: Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs
- Authors: Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou
- Abstract summary: Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
- Score: 84.3235981545673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Token-mixing multi-layer perceptron (MLP) models have shown competitive
performance in computer vision tasks with a simple architecture and relatively
small computational cost. Their success in maintaining computation efficiency
is mainly attributed to avoiding the use of self-attention that is often
computationally heavy, yet this is at the expense of not being able to mix
tokens both globally and locally. In this paper, to exploit both global and
local dependencies without self-attention, we present Mix-Shift-MLP (MS-MLP)
which makes the size of the local receptive field used for mixing increase with
respect to the amount of spatial shifting. In addition to conventional mixing
and shifting techniques, MS-MLP mixes both neighboring and distant tokens from
fine- to coarse-grained levels and then gathers them via a shifting operation.
This directly contributes to the interactions between global and local tokens.
Being simple to implement, MS-MLP achieves competitive performance in multiple
vision benchmarks. For example, an MS-MLP with 85 million parameters achieves
83.8% top-1 classification accuracy on ImageNet-1K. Moreover, by combining
MS-MLP with state-of-the-art Vision Transformers such as the Swin Transformer,
we show MS-MLP achieves further improvements on three different model scales,
e.g., by 0.5% on ImageNet-1K classification with Swin-B. The code is available
at: https://github.com/JegZheng/MS-MLP.
Related papers
- Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - ConvMLP: Hierarchical Convolutional MLPs for Vision [7.874749885641495]
We propose a hierarchical ConMLP: a light-weight, stage-wise, co-design for visual recognition.
We show that ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters.
arXiv Detail & Related papers (2021-09-09T17:52:57Z) - Hire-MLP: Vision MLP via Hierarchical Rearrangement [58.33383667626998]
Hire-MLP is a simple yet competitive vision architecture via rearrangement.
The proposed Hire-MLP architecture is built with simple channel-mixing operations, thus enjoys high flexibility and inference speed.
Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark.
arXiv Detail & Related papers (2021-08-30T16:11:04Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z) - S$^2$-MLP: Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
arXiv Detail & Related papers (2021-06-14T15:05:11Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.