Attention-Only Transformers and Implementing MLPs with Attention Heads
- URL: http://arxiv.org/abs/2309.08593v1
- Date: Fri, 15 Sep 2023 17:47:45 GMT
- Title: Attention-Only Transformers and Implementing MLPs with Attention Heads
- Authors: Robert Huben and Valerie Morris
- Abstract summary: We prove that a neuron can be implemented by a masked attention head with internal dimension 1.
We also prove that attention heads can encode arbitrary masking patterns in their weight to within arbitrarily small error.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer architecture is widely used in machine learning models and
consists of two alternating sublayers: attention heads and MLPs. We prove that
an MLP neuron can be implemented by a masked attention head with internal
dimension 1 so long as the MLP's activation function comes from a restricted
class including SiLU and close approximations of ReLU and GeLU. This allows one
to convert an MLP-and-attention transformer into an attention-only transformer
at the cost of greatly increasing the number of attention heads. We also prove
that attention heads can perform the components of an MLP (linear
transformations and activation functions) separately. Finally, we prove that
attention heads can encode arbitrary masking patterns in their weight matrices
to within arbitrarily small error.
Related papers
- Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion [20.437172251393257]
We propose a new simple but effective architecture called the Lateralization (L-MLP)
Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization (L-MLP)
arXiv Detail & Related papers (2024-05-25T07:10:02Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - MLP Architectures for Vision-and-Language Modeling: An Empirical Study [91.6393550858739]
We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion.
We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers.
Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
arXiv Detail & Related papers (2021-12-08T18:26:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.