The Devil in Linear Transformer
- URL: http://arxiv.org/abs/2210.10340v1
- Date: Wed, 19 Oct 2022 07:15:35 GMT
- Title: The Devil in Linear Transformer
- Authors: Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick
Barnes and Yiran Zhong
- Abstract summary: Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
- Score: 42.232886799710215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear transformers aim to reduce the quadratic space-time complexity of
vanilla transformers. However, they usually suffer from degraded performances
on various tasks and corpus. In this paper, we examine existing kernel-based
linear transformers and identify two key issues that lead to such performance
gaps: 1) unbounded gradients in the attention computation adversely impact the
convergence of linear transformer models; 2) attention dilution which trivially
distributes attention scores over long sequences while neglecting neighbouring
structures. To address these issues, we first identify that the scaling of
attention matrices is the devil in unbounded gradients, which turns out
unnecessary in linear attention as we show theoretically and empirically. To
this end, we propose a new linear attention that replaces the scaling operation
with a normalization to stabilize gradients. For the issue of attention
dilution, we leverage a diagonal attention to confine attention to only
neighbouring tokens in early layers. Benefiting from the stable gradients and
improved attention, our new linear transformer model, transNormer, demonstrates
superior performance on text classification and language modeling tasks, as
well as on the challenging Long-Range Arena benchmark, surpassing vanilla
transformer and existing linear variants by a clear margin while being
significantly more space-time efficient. The code is available at
https://github.com/OpenNLPLab/Transnormer .
Related papers
- Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.
Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.
We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Your Transformer is Secretly Linear [7.935853865895353]
We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship.
We show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance.
In our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity.
arXiv Detail & Related papers (2024-05-19T22:44:00Z) - The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry [24.198536617002667]
Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length.
We propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity.
arXiv Detail & Related papers (2024-02-06T19:31:26Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.