Related papers: Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

URL: http://arxiv.org/abs/2305.18699v1
Date: Tue, 30 May 2023 02:44:49 GMT
Title: Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input
Authors: Shokichi Takakura, Taiji Suzuki
Abstract summary: We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Our theoretical results support the practical success of Transformers for high dimensional data.
Score: 50.83356836818667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood. In this paper, we study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Although inputs and outputs are both infinite dimensional, we show that when the target function has anisotropic smoothness, Transformers can avoid the curse of dimensionality due to their feature extraction ability and parameter sharing property. In addition, we show that even if the smoothness changes depending on each input, Transformers can estimate the importance of features for each input and extract important features dynamically. Then, we proved that Transformers achieve similar convergence rate as in the case of the fixed smoothness. Our theoretical results support the practical success of Transformers for high dimensional data.

Related papers

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) [1.157192696857674]
This paper contributes to the study of the expressive capacity of transformers. We focus on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument.
arXiv Detail & Related papers (2025-03-28T01:40:23Z)
Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size [6.9060054915724]
Transformers are a type of neural network that have demonstrated remarkable performance across various domains. In this study, we investigate the ability of transformers to approximate column-symmetrics.
arXiv Detail & Related papers (2025-02-17T05:56:11Z)
Measure-to-measure interpolation using Transformers [6.13239149235581]
Transformers are deep neural network architectures that underpin the recent successes of large language models. A Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere. We provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures.
arXiv Detail & Related papers (2024-11-07T09:18:39Z)
Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals. By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z)
On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities. The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters. Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z)
Transformers, parallel computation, and logarithmic depth [33.659870765923884]
We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation.
arXiv Detail & Related papers (2024-02-14T15:54:55Z)
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
On the Computational Power of Transformers and its Implications in Sequence Modeling [10.497742214344855]
In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear. We provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete. We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.
arXiv Detail & Related papers (2020-06-16T16:27:56Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.