Strip-MLP: Efficient Token Interaction for Vision MLP
- URL: http://arxiv.org/abs/2307.11458v1
- Date: Fri, 21 Jul 2023 09:40:42 GMT
- Title: Strip-MLP: Efficient Token Interaction for Vision MLP
- Authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang,
Yaowei Wang, Jianguo Zhang
- Abstract summary: We introduce textbfStrip-MLP to enrich the token interaction power in three ways.
Strip-MLP significantly improves the performance of spatial-based models on small datasets.
Models achieve higher average Top-1 accuracy than existing datasets by +2.44% on Caltech-101 and +2.16% on CIFAR-100.
- Score: 31.02197585697145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Token interaction operation is one of the core modules in MLP-based models to
exchange and aggregate information between different spatial locations.
However, the power of token interaction on the spatial dimension is highly
dependent on the spatial resolution of the feature maps, which limits the
model's expressive ability, especially in deep layers where the feature are
down-sampled to a small spatial size. To address this issue, we present a novel
method called \textbf{Strip-MLP} to enrich the token interaction power in three
ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that
allows the token to interact with other tokens in a cross-strip manner,
enabling the tokens in a row (or column) to contribute to the information
aggregations in adjacent but different strips of rows (or columns). Secondly, a
\textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule
(CGSMM) is proposed to overcome the performance degradation caused by small
spatial feature size. The module allows tokens to interact more effectively in
the manners of within-patch and cross-patch, which is independent to the
feature spatial size. Finally, based on the Strip MLP layer, we propose a novel
\textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost
the token interaction power in the local region. Extensive experiments
demonstrate that Strip-MLP significantly improves the performance of MLP-based
models on small datasets and obtains comparable or even better results on
ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy
than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on
CIFAR-100. The source codes will be available
at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.
Related papers
- TriMLP: Revenge of a MLP-like Architecture in Sequential Recommendation [23.32537260687907]
We present a sequential-like architecture for sequential recommendation, namely TriMLP, with a novel Triangular Mixer for cross-token communications.
In designing Triangular Mixer, we simplify the cross-token operation inascii as the basic matrix multiplication, and drop the lower-triangle neurons of the weight matrix to block the anti-chronological order connections from future tokens.
arXiv Detail & Related papers (2023-05-24T03:32:31Z) - BiMLP: Compact Binary Architectures for Vision Multi-Layer Perceptrons [37.28828605119602]
This paper studies the problem of designing compact binary architectures for vision multi-layer perceptrons (MLPs)
We find that previous binarization methods perform poorly due to limited capacity of binary samplings.
We propose to improve the performance of binary mixing and channel mixing (BiMLP) model by enriching the representation ability of binary FC layers.
arXiv Detail & Related papers (2022-12-29T02:43:41Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - UNeXt: MLP-based Rapid Medical Image Segmentation Network [80.16644725886968]
UNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years.
We propose UNeXt which is a Convolutional multilayer perceptron based network for image segmentation.
We show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance.
arXiv Detail & Related papers (2022-03-09T18:58:22Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [113.1414517605892]
We propose a methodology, Locality Injection, to incorporate local priors into an FC layer.
RepMLPNet is the first that seamlessly transfer to Cityscapes semantic segmentation.
arXiv Detail & Related papers (2021-12-21T10:28:17Z) - Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost.
We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z) - Hire-MLP: Vision MLP via Hierarchical Rearrangement [58.33383667626998]
Hire-MLP is a simple yet competitive vision architecture via rearrangement.
The proposed Hire-MLP architecture is built with simple channel-mixing operations, thus enjoys high flexibility and inference speed.
Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark.
arXiv Detail & Related papers (2021-08-30T16:11:04Z) - CycleMLP: A MLP-like Architecture for Dense Prediction [26.74203747156439]
CycleMLP is a versatile backbone for visual recognition and dense predictions.
It can cope with various image sizes and achieves linear computational complexity to image size by using local windows.
CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for models.
arXiv Detail & Related papers (2021-07-21T17:23:06Z) - AS-MLP: An Axial Shifted MLP Architecture for Vision [50.11765148947432]
An Axial Shifted architecture (AS-MLP) is proposed in this paper.
By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different directions.
With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset.
arXiv Detail & Related papers (2021-07-18T08:56:34Z) - S$^2$-MLP: Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
arXiv Detail & Related papers (2021-06-14T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.