Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators?
- URL: http://arxiv.org/abs/2307.14023v3
- Date: Mon, 29 Jan 2024 10:16:41 GMT
- Title: Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators?
- Authors: Tokio Kajitsuka and Issei Sato
- Abstract summary: We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
- Score: 37.820617032391404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing analyses of the expressive capacity of Transformer models have
required excessively deep layers for data memorization, leading to a
discrepancy with the Transformers actually used in practice. This is primarily
due to the interpretation of the softmax function as an approximation of the
hardmax function. By clarifying the connection between the softmax function and
the Boltzmann operator, we prove that a single layer of self-attention with
low-rank weight matrices possesses the capability to perfectly capture the
context of an entire input sequence. As a consequence, we show that one-layer
and single-head Transformers have a memorization capacity for finite samples,
and that Transformers consisting of one self-attention layer with two
feed-forward neural networks are universal approximators for continuous
permutation equivariant functions on a compact domain.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance.
improved generalization ability occurs as the model memorizes the training samples.
We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Simplicity Bias in Transformers and their Ability to Learn Sparse
Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models.
This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.