Related papers: A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention

A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention

URL: http://arxiv.org/abs/2602.01763v1
Date: Mon, 02 Feb 2026 07:47:21 GMT
Title: A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention
Authors: Xiaowei Ye, Xiaoyu He, Chao Liao, Chen Wu, Pinyan Lu,
Abstract summary: Transformers serve as the foundation of most modern large language models.<n>A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization.<n>Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.
Score: 13.144793724034761
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model's forward pass, an ($L+1$)-layer full attention network is sufficient, whereas any hybrid network interleaving $L-1$ layers of full attention with a substantially larger number ($2^{3L^2}$) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.

Related papers

TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction [3.9660062354591754]
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity limits practical deployment.<n> Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation.<n>We introduce a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task.<n>This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
arXiv Detail & Related papers (2026-01-16T02:01:40Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective. We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention. Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.