Transformer Based Linear Attention with Optimized GPU Kernel Implementation
- URL: http://arxiv.org/abs/2510.21956v1
- Date: Fri, 24 Oct 2025 18:32:20 GMT
- Title: Transformer Based Linear Attention with Optimized GPU Kernel Implementation
- Authors: Armin Gerami, Ramani Duraiswami,
- Abstract summary: Linear attention (LA) mechanisms offer a linear time complexity of $O(ND2)$ and have demonstrated comparable accuracy to regular attention.<n>We propose a novel method for LA's forward and backward passes, along with a highly-optimized implementation.<n>We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model.
- Score: 10.235738752130803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.
Related papers
- Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - Second-order Optimization of Gaussian Splats with Importance Sampling [51.95046424364725]
3D Gaussian Splatting (3DGS) is widely used for novel view rendering due to its high quality and fast inference time.<n>We propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG)<n>Our method achieves a $3times$ speedup over standard LM and outperforms Adam by $6times$ when the Gaussian count is low.
arXiv Detail & Related papers (2025-04-17T12:52:08Z) - Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.<n>We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z) - SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [34.548270527357126]
We propose SageAttention, a highly efficient and accurate quantization method for attention.<n>Our approach incurs almost no end-to-end metrics loss across diverse models.
arXiv Detail & Related papers (2024-10-03T10:25:23Z) - Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers [16.046186753149]
Self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs)
We leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention using convolution matrices.
We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.
arXiv Detail & Related papers (2024-05-08T17:11:38Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - Jorge: Approximate Preconditioning for GPU-efficient Second-order
Optimization [2.081667369602538]
We introduce Jorge, a second-order that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods.
We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner.
arXiv Detail & Related papers (2023-10-18T19:58:54Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Energon: Towards Efficient Acceleration of Transformers Using Dynamic
Sparse Attention [5.495006023171481]
transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks.
We propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention.
We demonstrate that Energon achieves $161times$ and $8.4times$ geo-mean speedup and up to $104times$ and $103times$ energy reduction.
arXiv Detail & Related papers (2021-10-18T13:42:43Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.