Related papers: Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

URL: http://arxiv.org/abs/2505.03205v2
Date: Fri, 13 Jun 2025 04:05:13 GMT
Title: Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights
Authors: Zhaiming Shen, Alex Havrilla, Rongjie Lai, Alexander Cloninger, Wenjing Liao,
Abstract summary: This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold.<n>We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold.<n>Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise.
Score: 47.62295798627317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto the manifold. We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold. Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise. Our novel proof technique constructs representations of basic arithmetic operations by transformers, which may hold independent interest.

Related papers

Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures [10.970776446566909]
This paper investigates the capabilities of transformers in solving unsupervised learning problems.<n>We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks.<n>We prove that transformers can approximate both the EM algorithm and a core component of spectral methods.
arXiv Detail & Related papers (2025-05-17T09:02:18Z)
A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks. All train/test/validation samples are present during training, making them more akin to a semi-supervised task. We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z)
Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.<n>We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.<n>It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
When can transformers reason with abstract symbols? [25.63285482210457]
We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason.
arXiv Detail & Related papers (2023-10-15T06:45:38Z)
FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pretraining [36.44039681893334]
Hyperspectral images (HSIs) contain rich spectral and spatial information. Current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension. We propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pretraining procedures.
arXiv Detail & Related papers (2023-09-18T02:05:52Z)
Representational Strengths and Limitations of Transformers [33.659870765923884]
We establish both positive and negative results on the representation power of attention layers. We show the necessity and role of a large embedding dimension in a transformer. We also present natural variants that can be efficiently solved by attention layers.
arXiv Detail & Related papers (2023-06-05T14:05:04Z)
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.