Related papers: Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets

Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets

URL: http://arxiv.org/abs/2602.20555v1
Date: Tue, 24 Feb 2026 05:14:01 GMT
Title: Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets
Authors: Yanming Lai, Defeng Sun,
Abstract summary: This paper is the first work proving that standard Transformers can approximate Hlder functions.<n>By introducing two metrics: the size and the dimension vector, we provide a fine-grained characterization of Transformer structures.
Score: 8.844802588836059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate Hölder functions $ C^{s,λ}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<λ\leq1) $ under the $L^t$ distance ($t \in [1, \infty]$) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for Hölder target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of independent interest. These findings provide theoretical justification for the powerful capabilities of Transformer models.

Related papers

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality [52.424255020469595]
Transformers excel through content-addressable retrieval and the ability to exploit contexts, in principle, length.<n>We recast associative memory at the level of probability measures, treating a context as a distribution over unbounded tokens.<n>We show that a shallow measure-theoretic Transformer learns the recall-and-predict map under a spectral assumption on the input densities.
arXiv Detail & Related papers (2026-02-02T09:34:17Z)
Finite-Time Analysis of Gradient Descent for Shallow Transformers [16.566605776410068]
We study why Transformers perform so well due to their nonstudent optimization landscape.<n>The tradeoff is memory: to keep the full context, the Transformer's memory requirement grows with the length.
arXiv Detail & Related papers (2026-01-23T07:28:17Z)
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective [7.069772598731282]
The Transformer model is widely used in various application areas of machine learning, such as natural language processing.<n>This paper investigates the approximation of the H"older continuous function class $mathcalH_Qbetaleft([0,1]dtimes n,mathbbRdtimes nright)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality.
arXiv Detail & Related papers (2025-04-18T08:56:53Z)
Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods.<n>Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Higher-Order Transformer Derivative Estimates for Explicit Pathwise Learning Guarantees [9.305677878388664]
This paper fills a gap in the literature by precisely estimating all the higher-order derivatives of all orders for the transformer model.<n>We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers.<n>We conclude that real-world transformers can learn from $N$ samples of a single Markov process's trajectory at a rate of $O(operatornamepolylog(N/sqrtN)$.
arXiv Detail & Related papers (2024-05-26T13:19:32Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.