Rethinking Token-Mixing MLP for MLP-based Vision Backbone
- URL: http://arxiv.org/abs/2106.14882v1
- Date: Mon, 28 Jun 2021 17:59:57 GMT
- Title: Rethinking Token-Mixing MLP for MLP-based Vision Backbone
- Authors: Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li
- Abstract summary: We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
- Score: 34.47616917228978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the past decade, we have witnessed rapid progress in the machine vision
backbone. By introducing the inductive bias from the image processing,
convolution neural network (CNN) has achieved excellent performance in numerous
computer vision tasks and has been established as \emph{de facto} backbone. In
recent years, inspired by the great success achieved by Transformer in NLP
tasks, vision Transformer models emerge. Using much less inductive bias, they
have achieved promising performance in computer vision tasks compared with
their CNN counterparts. More recently, researchers investigate using the
pure-MLP architecture to build the vision backbone to further reduce the
inductive bias, achieving good performance. The pure-MLP backbone is built upon
channel-mixing MLPs to fuse the channels and token-mixing MLPs for
communications between patches. In this paper, we re-think the design of the
token-mixing MLP. We discover that token-mixing MLPs in existing MLP-based
backbones are spatial-specific, and thus it is sensitive to spatial
translation. Meanwhile, the channel-agnostic property of the existing
token-mixing MLPs limits their capability in mixing tokens. To overcome those
limitations, we propose an improved structure termed as Circulant
Channel-Specific (CCS) token-mixing MLP, which is spatial-invariant and
channel-specific. It takes fewer parameters but achieves higher classification
accuracy on ImageNet1K benchmark.
Related papers
- SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning [40.994306592119266]
Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications.
Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning.
We propose to coin a lightweight PLM through NTK-approximating modules in fusion.
arXiv Detail & Related papers (2023-07-18T03:12:51Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing [123.43419144051703]
We present a novel-like 3D architecture for video recognition.
The results are comparable to state-of-the-art widely-used 3D CNNs and video.
arXiv Detail & Related papers (2022-06-13T16:21:33Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? [0.0]
CNN has reigned supreme in the world of computer vision for the past ten years, but recently, Transformer is on the rise.
In particular, our work indicates that models have the potential to replace CNNs by adopting inductive bias.
The proposed model, named RaftMLP, has a good balance of computational complexity, the number of parameters, and actual memory usage.
arXiv Detail & Related papers (2021-08-09T23:55:24Z) - CycleMLP: A MLP-like Architecture for Dense Prediction [26.74203747156439]
CycleMLP is a versatile backbone for visual recognition and dense predictions.
It can cope with various image sizes and achieves linear computational complexity to image size by using local windows.
CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for models.
arXiv Detail & Related papers (2021-07-21T17:23:06Z) - S$^2$-MLP: Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
arXiv Detail & Related papers (2021-06-14T15:05:11Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.