Related papers: Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

URL: http://arxiv.org/abs/2504.13558v1
Date: Fri, 18 Apr 2025 08:56:53 GMT
Title: Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Authors: Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan,
Abstract summary: The Transformer model is widely used in various application areas of machine learning, such as natural language processing.<n>This paper investigates the approximation of the H"older continuous function class $mathcalH_Qbetaleft([0,1]dtimes n,mathbbRdtimes nright)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality.
Score: 7.069772598731282
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the H\"older continuous function class $\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of $\epsilon$, if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers of feedforward layers are needed, with widths of these layers not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Representation Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of feedforward neural networks to Transformer research.

Related papers

Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods.<n>Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Higher-Order Transformer Derivative Estimates for Explicit Pathwise Learning Guarantees [9.305677878388664]
This paper fills a gap in the literature by precisely estimating all the higher-order derivatives of all orders for the transformer model. We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers. We conclude that real-world transformers can learn from $N$ samples of a single Markov process's trajectory at a rate of $O(operatornamepolylog(N/sqrtN)$.
arXiv Detail & Related papers (2024-05-26T13:19:32Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z)
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.