Related papers: Poly-attention: a general scheme for higher-order self-attention

Poly-attention: a general scheme for higher-order self-attention

URL: http://arxiv.org/abs/2602.02422v1
Date: Mon, 02 Feb 2026 18:24:53 GMT
Title: Poly-attention: a general scheme for higher-order self-attention
Authors: Sayak Chakrabarti, Toniann Pitassi, Josh Alman,
Abstract summary: We define a vast class of generalizations of self-attention, which we call poly-attention mechanisms.<n>Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens.<n>We give new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix.
Score: 16.719964872886315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting trade-offs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.

Related papers

Polynomial Mixing for Efficient Self-supervised Speech Encoders [50.58463928808225]
Polynomial Mixer (PoM) is a drop-in replacement for multi-head self-attention.<n>PoM achieves its performance on downstream speech recognition tasks.
arXiv Detail & Related papers (2026-02-28T14:45:55Z)
How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy [39.666337901651865]
We put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity.<n>RBA has several significant strengths as follows:.<n>It can be implemented in parallel on a new dimension, which contributes to much memory saving.
arXiv Detail & Related papers (2025-11-08T15:34:15Z)
Fast attention mechanisms: a tale of parallelism [52.7657529272906]
We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity.<n>We prove that ANNA-transformers retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms.
arXiv Detail & Related papers (2025-09-10T20:59:44Z)
Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
QAMA: Scalable Quantum Annealing Multi-Head Attention Operator for Deep Learning [48.12231190677108]
Quantum Annealing Multi-Head Attention (QAMA) is proposed, a novel drop-in operator that reformulates attention as an energy-based Hamiltonian optimization problem.<n>In this framework, token interactions are encoded into binary quadratic terms, and quantum annealing is employed to search for low-energy configurations.<n> Empirically, evaluation on both natural language and vision benchmarks shows that, across tasks, accuracy deviates by at most 2.7 points from standard multi-head attention.
arXiv Detail & Related papers (2025-04-15T11:29:09Z)
Fundamental Limitations on Subquadratic Alternatives to Transformers [3.514389461266844]
We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is approximately the most similar.<n>We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm.
arXiv Detail & Related papers (2024-10-05T19:21:13Z)
Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers.<n>We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z)
Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches. We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z)
On The Computational Complexity of Self-Attention [22.3395465641384]
Modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time.
arXiv Detail & Related papers (2022-09-11T14:38:10Z)
Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning. Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.