Breaking the Low-Rank Dilemma of Linear Attention
- URL: http://arxiv.org/abs/2411.07635v3
- Date: Sun, 17 Nov 2024 12:56:16 GMT
- Title: Breaking the Low-Rank Dilemma of Linear Attention
- Authors: Qihang Fan, Huaibo Huang, Ran He,
- Abstract summary: Linear attention provides a far more efficient solution by reducing the complexity to linear levels.
Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.
We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
- Score: 61.55583836370135
- License:
- Abstract: The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA.
Related papers
- Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry [24.198536617002667]
Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length.
We propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity.
arXiv Detail & Related papers (2024-02-06T19:31:26Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z) - Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism.
We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention.
We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.