Related papers: Finite-Time Analysis of Gradient Descent for Shallow Transformers

Finite-Time Analysis of Gradient Descent for Shallow Transformers

URL: http://arxiv.org/abs/2601.16514v1
Date: Fri, 23 Jan 2026 07:28:17 GMT
Title: Finite-Time Analysis of Gradient Descent for Shallow Transformers
Authors: Enes Arda, Semih Cayci, Atilla Eryilmaz,
Abstract summary: We study why Transformers perform so well due to their nonstudent optimization landscape.<n>The tradeoff is memory: to keep the full context, the Transformer's memory requirement grows with the length.
Score: 16.566605776410068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and confirm the predicted scaling laws for Transformers.

Related papers

Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets [8.844802588836059]
This paper is the first work proving that standard Transformers can approximate Hlder functions.<n>By introducing two metrics: the size and the dimension vector, we provide a fine-grained characterization of Transformer structures.
arXiv Detail & Related papers (2026-02-24T05:14:01Z)
Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers [5.687100661457289]
We prove that a pretrained transformer with $(log n) parameters and $bigl(n2/(2+d)log3 nbigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence.
arXiv Detail & Related papers (2026-01-21T14:13:38Z)
Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods.<n>Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.