Your Transformer is Secretly Linear
- URL: http://arxiv.org/abs/2405.12250v1
- Date: Sun, 19 May 2024 22:44:00 GMT
- Title: Your Transformer is Secretly Linear
- Authors: Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov,
- Abstract summary: We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship.
We show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance.
In our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity.
- Score: 7.935853865895353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective [32.01426831450348]
The two-stage fine-tuning (FT) method, linear probing (LP) then fine-tuning (LP-FT), outperforms linear probing and FT alone.
We analyze the training dynamics of LP-FT for classification tasks on the basis of the neural tangent kernel (NTK) theory.
Our study demonstrates the effectiveness of LP-FT for fine-tuning language models.
arXiv Detail & Related papers (2024-05-27T01:31:40Z) - Linear Transformers are Versatile In-Context Learners [19.988368693379087]
We prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem.
We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise.
Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm.
arXiv Detail & Related papers (2024-02-21T23:45:57Z) - Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z) - The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z) - Memory-Efficient Backpropagation through Large Linear Layers [107.20037639738433]
In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.
This study proposes a memory reduction approach to perform backpropagation through linear layers.
arXiv Detail & Related papers (2022-01-31T13:02:41Z) - LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning.
LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.