On the Convergence of Encoder-only Shallow Transformers
- URL: http://arxiv.org/abs/2311.01575v1
- Date: Thu, 2 Nov 2023 20:03:05 GMT
- Title: On the Convergence of Encoder-only Shallow Transformers
- Authors: Yongtao Wu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher
- Abstract summary: We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
- Score: 62.639819460956176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we aim to build the global convergence theory of encoder-only
shallow Transformers under a realistic setting from the perspective of
architectures, initialization, and scaling under a finite width regime. The
difficulty lies in how to tackle the softmax in self-attention mechanism, the
core ingredient of Transformer. In particular, we diagnose the scaling scheme,
carefully tackle the input/output of softmax, and prove that quadratic
overparameterization is sufficient for global convergence of our shallow
Transformers under commonly-used He/LeCun initialization in practice. Besides,
neural tangent kernel (NTK) based analysis is also given, which facilitates a
comprehensive comparison. Our theory demonstrates the separation on the
importance of different scaling schemes and initialization. We believe our
results can pave the way for a better understanding of modern Transformers,
particularly on training dynamics.
Related papers
- What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning.
At its core, the attention block differs in form and functionality from most other architectural components in deep learning.
The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent [51.50999191584981]
Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam.
We study how SignGD optimize a two-layer transformer on a noisy dataset.
We find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks.
arXiv Detail & Related papers (2024-10-07T09:36:43Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers.
We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance.
improved generalization ability occurs as the model memorizes the training samples.
We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.