Sumformer: Universal Approximation for Efficient Transformers
- URL: http://arxiv.org/abs/2307.02301v1
- Date: Wed, 5 Jul 2023 13:59:35 GMT
- Title: Sumformer: Universal Approximation for Efficient Transformers
- Authors: Silas Alberti, Niclas Dern, Laura Thesing, Gitta Kutyniok
- Abstract summary: We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions.
We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
- Score: 2.4832703558223725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) made an impressive jump with the
introduction of Transformers. ChatGPT is one of the most famous examples,
changing the perception of the possibilities of AI even outside the research
community. However, besides the impressive performance, the quadratic time and
space complexity of Transformers with respect to sequence length pose
significant limitations for handling long sequences. While efficient
Transformer architectures like Linformer and Performer with linear complexity
have emerged as promising solutions, their theoretical understanding remains
limited. In this paper, we introduce Sumformer, a novel and simple architecture
capable of universally approximating equivariant sequence-to-sequence
functions. We use Sumformer to give the first universal approximation results
for Linformer and Performer. Moreover, we derive a new proof for Transformers,
showing that just one attention layer is sufficient for universal
approximation.
Related papers
- Looped Transformers for Length Generalization [41.99378201613648]
We show that looped Transformers with an adaptive number of steps significantly improve length generalization.
We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
arXiv Detail & Related papers (2024-09-24T01:21:17Z) - Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.