Related papers: Fast attention mechanisms: a tale of parallelism

Fast attention mechanisms: a tale of parallelism

URL: http://arxiv.org/abs/2509.09001v1
Date: Wed, 10 Sep 2025 20:59:44 GMT
Title: Fast attention mechanisms: a tale of parallelism
Authors: Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu,
Abstract summary: We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity.<n>We prove that ANNA-transformers retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms.
Score: 52.7657529272906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.

Related papers

Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers [0.0]
We introduce HSM, a framework for token mixing that distributes pairwise token interactions across Transformer layers.<n>HSM enables linear-time complexity while remaining to the specific mixing function.<n>We show that even simple HSM variants achieve performance close to softmax attention.
arXiv Detail & Related papers (2026-01-30T11:23:14Z)
Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving [1.9508863993381267]
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility.<n>In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs)
arXiv Detail & Related papers (2025-07-14T11:40:17Z)
QAMA: Scalable Quantum Annealing Multi-Head Attention Operator for Deep Learning [48.12231190677108]
Quantum Annealing Multi-Head Attention (QAMA) is proposed, a novel drop-in operator that reformulates attention as an energy-based Hamiltonian optimization problem.<n>In this framework, token interactions are encoded into binary quadratic terms, and quantum annealing is employed to search for low-energy configurations.<n> Empirically, evaluation on both natural language and vision benchmarks shows that, across tasks, accuracy deviates by at most 2.7 points from standard multi-head attention.
arXiv Detail & Related papers (2025-04-15T11:29:09Z)
Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers [0.4969640751053581]
We show that probabilistic computers (p-computers) provide a compelling and scalable classical pathway for solving hard optimization problems.<n>We focus on two key algorithms applied to 3D spin glasses: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT)<n>We show that APT, when supported by non-local isoenergetic cluster moves, exhibits a more favorable scaling and ultimately outperforms DT-SQA.
arXiv Detail & Related papers (2025-03-13T12:24:13Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.<n>We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.