Related papers: Higher-order Linear Attention

Higher-order Linear Attention

URL: http://arxiv.org/abs/2510.27258v1
Date: Fri, 31 Oct 2025 07:54:37 GMT
Title: Higher-order Linear Attention
Authors: Yifan Zhang, Zhen Qin, Quanquan Gu,
Abstract summary: quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
Score: 59.92962330635185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

Related papers

SLA2: Sparse-Linear Attention with Learnable Routing and QAT [86.22100800353991]
We show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.<n>Experiments show that SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
arXiv Detail & Related papers (2026-02-13T07:16:02Z)
Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking [7.437238821092346]
State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers.<n>We address these limitations by reframing sequence modelling through a probabilistic lens.<n>We introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference.
arXiv Detail & Related papers (2026-02-11T11:11:45Z)
LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z)
Spectral-Window Hybrid (SWH) [0.0]
Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity.<n>We propose the textbfSpectral-Window Hybrid (SWH), an architecture that decouples sequence modeling into two textitparallel streams.<n>We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences.
arXiv Detail & Related papers (2026-01-04T00:31:36Z)
LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport [21.50165411149415]
We propose a principled attention mechanism that is simultaneously linear-time and doubly-stochastic.<n>LotFormer achieves state-of-the-art results on the Long Range Arena benchmark.
arXiv Detail & Related papers (2025-09-27T18:11:09Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Unifying Autoregressive and Diffusion-Based Sequence Generation [3.1853022872760186]
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models.<n>We introduce hyperschedules, which assign distinct noise schedules to individual token positions.<n>Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes.
arXiv Detail & Related papers (2025-04-08T20:32:10Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Highly Parallel Autoregressive Entity Linking with Discriminative Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions. Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.