Related papers: Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

URL: http://arxiv.org/abs/2207.07284v1
Date: Fri, 15 Jul 2022 04:18:06 GMT
Title: Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP
Authors: Zhicai Wang, Yanbin Hao, Xingyu Gao, Hao Zhang, Shuo Wang, Tingting Mu, Xiangnan He
Abstract summary: Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
Score: 52.25478388220691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at \href{https://github.com/Zhicaiwww/PosMLP}{https://github.com/Zhicaiwww/PosMLP}.

Related papers

Attention Learning is Needed to Efficiently Learn Parity Function [6.944372188747803]
We analyze transformers on the $k$-parity problem. We prove that transformers require only $O(k)$ parameters, surpassing the theoretical lower bound needed by FFNNs.
arXiv Detail & Related papers (2025-02-11T13:41:30Z)
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone. The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition [37.62114379192619]
PosMLP-Video is a lightweight yet powerful backbone-like model for video recognition. PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy.
arXiv Detail & Related papers (2024-07-03T09:07:14Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks. Much less research has been devoted to the channel mixer or feature mixing block (FFN or) We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z)
Strip-MLP: Efficient Token Interaction for Vision MLP [31.02197585697145]
We introduce textbfStrip-MLP to enrich the token interaction power in three ways. Strip-MLP significantly improves the performance of spatial-based models on small datasets. Models achieve higher average Top-1 accuracy than existing datasets by +2.44% on Caltech-101 and +2.16% on CIFAR-100.
arXiv Detail & Related papers (2023-07-21T09:40:42Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
UNeXt: MLP-based Rapid Medical Image Segmentation Network [80.16644725886968]
UNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years. We propose UNeXt which is a Convolutional multilayer perceptron based network for image segmentation. We show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance.
arXiv Detail & Related papers (2022-03-09T18:58:22Z)
Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks. We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting. MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z)
S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Fuse-based vision architectures with less inductive bias achieve competitive performance in image recognition. In this paper, we improve the S$2$-MLP vision backbone. Our medium-scale model, S$2$-MLPv2-Medium achieves an $83.6%$ top-1 accuracy on the ImageNet-1K benchmark.
arXiv Detail & Related papers (2021-08-02T17:59:02Z)
Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific. It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.