Related papers: The Devil in Linear Transformer

The Devil in Linear Transformer

URL: http://arxiv.org/abs/2210.10340v1
Date: Wed, 19 Oct 2022 07:15:35 GMT
Title: The Devil in Linear Transformer
Authors: Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes and Yiran Zhong
Abstract summary: Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
Score: 42.232886799710215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .

Related papers

Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention. We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map. We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z)
Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z)
Your Transformer is Secretly Linear [7.935853865895353]
We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship. We show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. In our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity.
arXiv Detail & Related papers (2024-05-19T22:44:00Z)
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry [24.198536617002667]
Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. We propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity.
arXiv Detail & Related papers (2024-02-06T19:31:26Z)
Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.