Related papers: Superlinear Multi-Step Attention

Superlinear Multi-Step Attention

URL: http://arxiv.org/abs/2601.18401v1
Date: Mon, 26 Jan 2026 11:58:42 GMT
Title: Superlinear Multi-Step Attention
Authors: Yufeng Huang,
Abstract summary: Superlinear attention reformulates standard causal self-attention as a multi-step search problem with $N$ steps.<n>Superlinear attention reformulates standard causal self-attention as a multi-step search problem with $N$ steps.
Score: 8.782622621289251
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose \textbf{Superlinear attention}, a fully trainable multi-step attention architecture that achieves subquadratic complexity for long sequences while preserving \textbf{random context access} (a.k.a.\ structural non-exclusion): no eligible token position is structurally excluded from being selected for attention. Superlinear attention reformulates standard causal self-attention as a multi-step search problem with $N$ steps, yielding an overall complexity of $O(L^{1+\frac{1}{N}})$. To illustrate the architecture, we present a baseline $N=2$ implementation, which is algorithmically analogous to standard jump search. In this $O(L^{3/2})$ instantiation, the first step performs $O(L^{3/2})$ span-search to select relevant spans of the sequence, and the second step applies $O(L^{3/2})$ span-attention (standard attention restricted to the selected spans). In an upscaled $O(L^{1.54})$ configuration for robustness, we achieve an average decoding throughput of 114 tokens/sec at 1M context length and 80 tokens/sec at 10M context in our implementation on a modified 30B hybrid MoE model on a single B200 GPU. With limited training, we also obtain strong performance on the NIAH (Needle In A Haystack) task up to 256K context length, demonstrating that the routed span selection is learnable end-to-end. This paper emphasizes architectural formulation, scaling analysis, and systems feasibility, and presents initial validation; comprehensive quality evaluations across diverse long-context tasks are left to future work.

Related papers

Accelerated Evolving Set Processes for Local PageRank Computation [75.54334100808022]
This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank computation.<n>We show that the time complexity of such localized methods is upper bounded by $mintildemathcalO(R2/epsilon2), tildemathcalO(m)$ to obtain an $epsilon$-approximation of the PPR vector.
arXiv Detail & Related papers (2025-10-09T09:47:40Z)
ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models [4.951427498576812]
We introduce a novel context compression pipeline called Intermediate Layer Retrieval (ILRe)<n>ILRe encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer.<n>Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute.
arXiv Detail & Related papers (2025-08-25T10:59:02Z)
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Uncovering Untapped Potential in Sample-Efficient World Model Agents [51.65485693709418]
Simulus is a highly modular TBWM agent that integrates a multi-modality tokenization framework, intrinsic motivation, prioritized WM replay, and regression-as-classification.<n>Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks.
arXiv Detail & Related papers (2025-02-17T08:06:10Z)
ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences.<n>ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z)
Fine-grained Analysis and Faster Algorithms for Iteratively Solving Linear Systems [9.30306458153248]
Despite being a key bottleneck in many machine learning tasks, the cost of solving large linear systems has proven challenging to quantify.<n>We consider a fine-grained notion of complexity for solving linear systems, which is motivated by applications where the data exhibits low-dimensional structure.<n>We give a algorithm based on the Sketch-and-Project paradigm, that solves the linear system $Ax = b$, that is, finds $barx$ such that $|Abarx - b| le epsilon |b|$, in time
arXiv Detail & Related papers (2024-05-09T14:56:49Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [67.66904892192794]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.<n>MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.<n>Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z)
Simplifying and Understanding State Space Models with Diagonal Linear RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs. We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs. We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z)
Quantum Resources Required to Block-Encode a Matrix of Classical Data [56.508135743727934]
We provide circuit-level implementations and resource estimates for several methods of block-encoding a dense $Ntimes N$ matrix of classical data to precision $epsilon$. We examine resource tradeoffs between the different approaches and explore implementations of two separate models of quantum random access memory (QRAM) Our results go beyond simple query complexity and provide a clear picture into the resource costs when large amounts of classical data are assumed to be accessible to quantum algorithms.
arXiv Detail & Related papers (2022-06-07T18:00:01Z)
Adaptive Multi-Goal Exploration [118.40427257364729]
We show how AdaGoal can be used to tackle the objective of learning an $epsilon$-optimal goal-conditioned policy. AdaGoal is anchored in the high-level algorithmic structure of existing methods for goal-conditioned deep reinforcement learning.
arXiv Detail & Related papers (2021-11-23T17:59:50Z)
An Online Riemannian PCA for Stochastic Canonical Correlation Analysis [37.8212762083567]
We present an efficient algorithm (RSG+) for canonical correlation analysis (CCA) using a reparametrization of the projection matrices. While the paper primarily focuses on the formulation and technical analysis of its properties, our experiments show that the empirical behavior on common datasets is quite promising.
arXiv Detail & Related papers (2021-06-08T23:38:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.