On the Convergence of Encoder-only Shallow Transformers
- URL: http://arxiv.org/abs/2311.01575v1
- Date: Thu, 2 Nov 2023 20:03:05 GMT
- Title: On the Convergence of Encoder-only Shallow Transformers
- Authors: Yongtao Wu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher
- Abstract summary: We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
- Score: 62.639819460956176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we aim to build the global convergence theory of encoder-only
shallow Transformers under a realistic setting from the perspective of
architectures, initialization, and scaling under a finite width regime. The
difficulty lies in how to tackle the softmax in self-attention mechanism, the
core ingredient of Transformer. In particular, we diagnose the scaling scheme,
carefully tackle the input/output of softmax, and prove that quadratic
overparameterization is sufficient for global convergence of our shallow
Transformers under commonly-used He/LeCun initialization in practice. Besides,
neural tangent kernel (NTK) based analysis is also given, which facilitates a
comprehensive comparison. Our theory demonstrates the separation on the
importance of different scaling schemes and initialization. We believe our
results can pave the way for a better understanding of modern Transformers,
particularly on training dynamics.
Related papers
- Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers.
We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance.
improved generalization ability occurs as the model memorizes the training samples.
We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - White-Box Transformers via Sparse Rate Reduction: Compression Is All
There Is? [28.507148793856388]
We show a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable.
Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets.
arXiv Detail & Related papers (2023-11-22T02:23:32Z) - Are Transformers with One Layer Self-Attention Using Low-Rank Weight
Matrices Universal Approximators? [37.820617032391404]
We show that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.
One-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
arXiv Detail & Related papers (2023-07-26T08:07:37Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.