Related papers: Pay Attention to MLPs

Pay Attention to MLPs

URL: http://arxiv.org/abs/2105.08050v1
Date: Mon, 17 May 2021 17:55:04 GMT
Title: Pay Attention to MLPs
Authors: Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le
Abstract summary: We show that gMLP can perform as well as Transformers in key language and applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
Score: 84.54729425918164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

Related papers

Understanding Factual Recall in Transformers via Associative Memories [55.93756571457904]
We show that shallow transformers can use a combination of associative memories to obtain near optimal storage capacity. We show that a transformer with a single layer of self-attention followed by an parameters can obtain 100% accuracy on a factual recall task.
arXiv Detail & Related papers (2024-12-09T14:48:14Z)
MLPs Learn In-Context on Regression and Classification Tasks [28.13046236900491]
In-context learning (ICL) is often assumed to be a unique hallmark of Transformer models. We demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Results highlight the unexpected competence of exemplars in a synthetic setting.
arXiv Detail & Related papers (2024-05-24T15:04:36Z)
Attention-Only Transformers and Implementing MLPs with Attention Heads [0.0]
We prove that a neuron can be implemented by a masked attention head with internal dimension 1. We also prove that attention heads can encode arbitrary masking patterns in their weight to within arbitrarily small error.
arXiv Detail & Related papers (2023-09-15T17:47:45Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
Wide Attention Is The Way Forward For Transformers [9.252523881586054]
We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
arXiv Detail & Related papers (2022-10-02T21:49:54Z)
Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens) We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z)
MLP Architectures for Vision-and-Language Modeling: An Empirical Study [91.6393550858739]
We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion. We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers. Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
arXiv Detail & Related papers (2021-12-08T18:26:19Z)
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet. For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z)
Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models. We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.