Higher-order Linear Attention
- URL: http://arxiv.org/abs/2510.27258v1
- Date: Fri, 31 Oct 2025 07:54:37 GMT
- Title: Higher-order Linear Attention
- Authors: Yifan Zhang, Zhen Qin, Quanquan Gu,
- Abstract summary: quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
- Score: 59.92962330635185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
Related papers
- SLA2: Sparse-Linear Attention with Learnable Routing and QAT [86.22100800353991]
We show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.<n>Experiments show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
arXiv Detail & Related papers (2026-02-13T07:16:02Z) - Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking [7.437238821092346]
State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers.<n>We address these limitations by reframing sequence modelling through a probabilistic lens.<n>We introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference.
arXiv Detail & Related papers (2026-02-11T11:11:45Z) - LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - Spectral-Window Hybrid (SWH) [0.0]
Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity.<n>We propose the textbfSpectral-Window Hybrid (SWH), an architecture that decouples sequence modeling into two textitparallel streams.<n>We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences.
arXiv Detail & Related papers (2026-01-04T00:31:36Z) - LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport [21.50165411149415]
We propose a principled attention mechanism that is simultaneously linear-time and doubly-stochastic.<n>LotFormer achieves state-of-the-art results on the Long Range Arena benchmark.
arXiv Detail & Related papers (2025-09-27T18:11:09Z) - Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z) - Unifying Autoregressive and Diffusion-Based Sequence Generation [3.1853022872760186]
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models.<n>We introduce hyperschedules, which assign distinct noise schedules to individual token positions.<n>Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes.
arXiv Detail & Related papers (2025-04-08T20:32:10Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Highly Parallel Autoregressive Entity Linking with Discriminative
Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions.
Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.