The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
- URL: http://arxiv.org/abs/2402.04347v1
- Date: Tue, 6 Feb 2024 19:31:26 GMT
- Title: The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
- Authors: Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R\'e
- Abstract summary: Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length.
We propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity.
- Score: 24.198536617002667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear attentions have shown potential for improving Transformer efficiency,
reducing attention's quadratic complexity to linear in sequence length. This
holds exciting promise for (1) training linear Transformers from scratch, (2)
"finetuned-conversion" of task-specific Transformers into linear versions that
recover task performance, and (3) "pretrained-conversion" of Transformers such
as large language models into linear versions finetunable on downstream tasks.
However, linear attentions often underperform standard softmax attention in
quality. To close this performance gap, we find prior linear attentions lack
key properties of softmax attention tied to good performance: low-entropy (or
"spiky") weights and dot-product monotonicity. We further observe surprisingly
simple feature maps that retain these properties and match softmax performance,
but are inefficient to compute in linear attention. We thus propose Hedgehog, a
learnable linear attention that retains the spiky and monotonic properties of
softmax attention while maintaining linear complexity. Hedgehog uses simple
trainable MLPs to produce attention weights mimicking softmax attention.
Experiments show Hedgehog recovers over 99% of standard Transformer quality in
train-from-scratch and finetuned-conversion settings, outperforming prior
linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs,
and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also
enables pretrained-conversion. Converting a pretrained GPT-2 into a linear
attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for
125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into
a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B
achieves 28.1 higher ROUGE-1 points over the base standard attention model,
where prior linear attentions lead to 16.5 point drops.
Related papers
- Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.
Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.
We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.