The Devil in Linear Transformer
- URL: http://arxiv.org/abs/2210.10340v1
- Date: Wed, 19 Oct 2022 07:15:35 GMT
- Title: The Devil in Linear Transformer
- Authors: Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick
Barnes and Yiran Zhong
- Abstract summary: Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
- Score: 42.232886799710215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear transformers aim to reduce the quadratic space-time complexity of
vanilla transformers. However, they usually suffer from degraded performances
on various tasks and corpus. In this paper, we examine existing kernel-based
linear transformers and identify two key issues that lead to such performance
gaps: 1) unbounded gradients in the attention computation adversely impact the
convergence of linear transformer models; 2) attention dilution which trivially
distributes attention scores over long sequences while neglecting neighbouring
structures. To address these issues, we first identify that the scaling of
attention matrices is the devil in unbounded gradients, which turns out
unnecessary in linear attention as we show theoretically and empirically. To
this end, we propose a new linear attention that replaces the scaling operation
with a normalization to stabilize gradients. For the issue of attention
dilution, we leverage a diagonal attention to confine attention to only
neighbouring tokens in early layers. Benefiting from the stable gradients and
improved attention, our new linear transformer model, transNormer, demonstrates
superior performance on text classification and language modeling tasks, as
well as on the challenging Long-Range Arena benchmark, surpassing vanilla
transformer and existing linear variants by a clear margin while being
significantly more space-time efficient. The code is available at
https://github.com/OpenNLPLab/Transnormer .
Related papers
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
We also experiment with two hybrid models which combine DeltaNet layers with sliding-window attention layers every other layer or two global attention layers.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Your Transformer is Secretly Linear [7.935853865895353]
We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship.
We show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance.
In our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity.
arXiv Detail & Related papers (2024-05-19T22:44:00Z) - Latent Attention for Linear Time Transformers [8.640180203900583]
"Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks.
"Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks.
arXiv Detail & Related papers (2024-02-27T13:54:48Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.