Related papers: ZETA: Leveraging Z-order Curves for Efficient Top-k Attention

ZETA: Leveraging Z-order Curves for Efficient Top-k Attention

URL: http://arxiv.org/abs/2501.14577v2
Date: Thu, 13 Feb 2025 03:43:09 GMT
Title: ZETA: Leveraging Z-order Curves for Efficient Top-k Attention
Authors: Qiuhao Zeng, Jerry Huang, Peng Lu, Gezheng Xu, Boxing Chen, Charles Ling, Boyu Wang,
Abstract summary: We propose ZETA to enable parallel querying of past tokens for entire sequences.<n>ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
Score: 22.90397380324185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length $N$, rendering it prohibitively expensive for long sequences. A promising approach is top-$k$ attention, which selects only the $k$ most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing the existing top-$k$ attention method from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging \textbf{Z}-Order Curves for \textbf{E}fficient \textbf{T}op-$k$ \textbf{A}ttention, to enable parallel querying of past tokens for entire sequences. % in both space and time complexity of $\mathcal{O}(N \log N)$. We first theoretically show that the choice of key and query dimensions involves a trade-off between the curse of dimensionality and the preservation of relative distances after projection. In light of this insight, we propose reducing the dimensionality of keys and queries in contrast to values and further leverage $Z$-order curves to map low-dimensional keys and queries into \emph{one}-dimensional space, which permits parallel sorting, thereby largely improving the efficiency for top-$k$ token selection. Experimental results demonstrate that ZETA matches the performance of standard attention on the synthetic \textsc{Multi-Query Associative Recall} task and outperforms attention and its variants on \textsc{Long Range Arena} and \textsc{WikiText-103} language modeling.

Related papers

Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Attention with Trained Embeddings Provably Selects Important Tokens [73.77633297039097]
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited.<n>Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent.<n>Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.
arXiv Detail & Related papers (2025-05-22T21:00:09Z)
Compression Barriers for Autoregressive Transformers [0.8331054243801623]
Key limitation of autoregressive Transformers is the large memory needed to cache all previous key-value embeddings. We show that any algorithm requires $Omega(dcdot ed)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound.
arXiv Detail & Related papers (2025-02-21T21:37:52Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Linear Transformer Topological Masking with Graph Random Features [52.717865653036796]
We show how to parameterise topological masks as a learnable function of a weighted adjacency matrix. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data.
arXiv Detail & Related papers (2024-10-04T14:24:06Z)
Reorganizing attention-space geometry with expressive attention [2.7195102129095003]
We study expressive attention (EA), which is based on $(mathbfQTmathbfK)2$, the squared dot product. EA can be introduced into any attention-based code without additional compute costs or memory requirements.
arXiv Detail & Related papers (2024-07-26T08:41:58Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.<n>HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.<n>We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse [9.552839922307587]
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment.
arXiv Detail & Related papers (2024-04-03T12:37:34Z)
One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space [11.735802740426294]
Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously. We introduce a new algorithm that only reads one pass of data in a streaming fashion. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens.
arXiv Detail & Related papers (2023-11-24T18:35:00Z)
Tree Cross Attention [59.8891512435847]
Tree Cross Attention (TCA) is a module based on Cross Attention that only retrieves information from a logarithmic $mathcalO(log(N))$ number of tokens for performing inference. We show that TCA performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient.
arXiv Detail & Related papers (2023-09-29T16:50:23Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Simplifying and Understanding State Space Models with Diagonal Linear RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs. We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs. We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z)
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection [23.39962989492527]
Transformer-based language models such as BERT have achieved the state-of-the-art on various NLP tasks, but are computationally prohibitive. We present Pyramid-BERT where we replace previously useds with a em core-set based token selection method justified by theoretical results. The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths.
arXiv Detail & Related papers (2022-03-27T19:52:01Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Recurrently Predicting Hypergraphs [30.092688729343678]
A problem arises from the number of possible multi-way relationships, or hyperedges, scaling in $mathcalO(2n)$ for a set of $n$ elements. We propose a recurrent hypergraph neural network that predicts the incidence matrix by iteratively refining an initial guess of the solution.
arXiv Detail & Related papers (2021-06-26T01:12:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.