Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP
- URL: http://arxiv.org/abs/2207.07284v1
- Date: Fri, 15 Jul 2022 04:18:06 GMT
- Title: Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP
- Authors: Zhicai Wang, Yanbin Hao, Xingyu Gao, Hao Zhang, Shuo Wang, Tingting
Mu, Xiangnan He
- Abstract summary: Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
- Score: 52.25478388220691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision multi-layer perceptrons (MLPs) have shown promising performance in
computer vision tasks, and become the main competitor of CNNs and vision
Transformers. They use token-mixing layers to capture cross-token interactions,
as opposed to the multi-head self-attention mechanism used by Transformers.
However, the heavily parameterized token-mixing layers naturally lack
mechanisms to capture local information and multi-granular non-local relations,
thus their discriminative power is restrained. To tackle this issue, we propose
a new positional spacial gating unit (PoSGU). It exploits the attention
formulations used in the classical relative positional encoding (RPE), to
efficiently encode the cross-token relations for token mixing. It can
successfully reduce the current quadratic parameter complexity $O(N^2)$ of
vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and
further propose a group-wise extension to improve their expressive power with
the accomplishment of multi-granular contexts. These then serve as the key
building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate
the effectiveness of the proposed approach by conducting thorough experiments,
demonstrating an improved or comparable performance with reduced parameter
complexity. For instance, for a model trained on ImageNet1K, we achieve a
performance improvement from 72.14\% to 74.02\% and a learnable parameter
reduction from $19.4M$ to $18.2M$. Code could be found at
\href{https://github.com/Zhicaiwww/PosMLP}{https://github.com/Zhicaiwww/PosMLP}.
Related papers
- PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition [37.62114379192619]
PosMLP-Video is a lightweight yet powerful backbone-like model for video recognition.
PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy.
arXiv Detail & Related papers (2024-07-03T09:07:14Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Strip-MLP: Efficient Token Interaction for Vision MLP [31.02197585697145]
We introduce textbfStrip-MLP to enrich the token interaction power in three ways.
Strip-MLP significantly improves the performance of spatial-based models on small datasets.
Models achieve higher average Top-1 accuracy than existing datasets by +2.44% on Caltech-101 and +2.16% on CIFAR-100.
arXiv Detail & Related papers (2023-07-21T09:40:42Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - UNeXt: MLP-based Rapid Medical Image Segmentation Network [80.16644725886968]
UNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years.
We propose UNeXt which is a Convolutional multilayer perceptron based network for image segmentation.
We show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance.
arXiv Detail & Related papers (2022-03-09T18:58:22Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Fuse-based vision architectures with less inductive bias achieve competitive performance in image recognition.
In this paper, we improve the S$2$-MLP vision backbone.
Our medium-scale model, S$2$-MLPv2-Medium achieves an $83.6%$ top-1 accuracy on the ImageNet-1K benchmark.
arXiv Detail & Related papers (2021-08-02T17:59:02Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.