MLP Can Be A Good Transformer Learner
- URL: http://arxiv.org/abs/2404.05657v1
- Date: Mon, 8 Apr 2024 16:40:15 GMT
- Title: MLP Can Be A Good Transformer Learner
- Authors: Sihao Lin, Pumeng Lyu, Dongrui Liu, Tao Tang, Xiaodan Liang, Andy Song, Xiaojun Chang,
- Abstract summary: Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
- Score: 73.01739251050076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at https://github.com/sihaoevery/lambda_vit.
Related papers
- What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks.
It also introduces redundant architectures, posing efficiency challenges for real-world deployment.
We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - giMLPs: Gate with Inhibition Mechanism in MLPs [13.288519661160898]
Gate with inhibition (giMLP) can produce equal performance on the ImageNet classification task.
Gate With Inhibition can achieve appealing results on most parts of NLU tasks without any extra pretraining again.
Experiments on ImageNet and twelve language downstream tasks demonstrate the effectiveness of Gate With Inhibition.
arXiv Detail & Related papers (2022-08-01T15:23:51Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [113.1414517605892]
We propose a methodology, Locality Injection, to incorporate local priors into an FC layer.
RepMLPNet is the first that seamlessly transfer to Cityscapes semantic segmentation.
arXiv Detail & Related papers (2021-12-21T10:28:17Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost.
We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.