Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input
- URL: http://arxiv.org/abs/2305.18699v1
- Date: Tue, 30 May 2023 02:44:49 GMT
- Title: Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input
- Authors: Shokichi Takakura, Taiji Suzuki
- Abstract summary: We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
- Score: 50.83356836818667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the great success of Transformer networks in various applications
such as natural language processing and computer vision, their theoretical
aspects are not well understood. In this paper, we study the approximation and
estimation ability of Transformers as sequence-to-sequence functions with
infinite dimensional inputs. Although inputs and outputs are both infinite
dimensional, we show that when the target function has anisotropic smoothness,
Transformers can avoid the curse of dimensionality due to their feature
extraction ability and parameter sharing property. In addition, we show that
even if the smoothness changes depending on each input, Transformers can
estimate the importance of features for each input and extract important
features dynamically. Then, we proved that Transformers achieve similar
convergence rate as in the case of the fixed smoothness. Our theoretical
results support the practical success of Transformers for high dimensional
data.
Related papers
- Measure-to-measure interpolation using Transformers [6.13239149235581]
Transformers are deep neural network architectures that underpin the recent successes of large language models.
A Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere.
We provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures.
arXiv Detail & Related papers (2024-11-07T09:18:39Z) - Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - Transformers, parallel computation, and logarithmic depth [33.659870765923884]
We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation.
arXiv Detail & Related papers (2024-02-14T15:54:55Z) - Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - On the Computational Power of Transformers and its Implications in
Sequence Modeling [10.497742214344855]
In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear.
We provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete.
We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.
arXiv Detail & Related papers (2020-06-16T16:27:56Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.