Related papers: On the Convergence of Encoder-only Shallow Transformers

On the Convergence of Encoder-only Shallow Transformers

URL: http://arxiv.org/abs/2311.01575v1
Date: Thu, 2 Nov 2023 20:03:05 GMT
Title: On the Convergence of Encoder-only Shallow Transformers
Authors: Yongtao Wu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher
Abstract summary: We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
Score: 62.639819460956176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Related papers

Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning [9.730604030100318]
Large Language Models struggle with generalisation beyond their training distribution.<n>IB theory posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations.<n>We show that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations.<n>We propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache.
arXiv Detail & Related papers (2025-05-22T17:33:49Z)
Unraveling the Gradient Descent Dynamics of Transformers [37.096572564254515]
Gradient Descent (GD) can train a Transformer model to achieve a global optimal solution, especially when the input embedding dimension is large. We analyze the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels.
arXiv Detail & Related papers (2024-11-12T04:33:56Z)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning. At its core, the attention block differs in form and functionality from most other architectural components in deep learning. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent [51.50999191584981]
Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. We study how SignGD optimize a two-layer transformer on a noisy dataset. We find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks.
arXiv Detail & Related papers (2024-10-07T09:36:43Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers. We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance. improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.