Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
- URL: http://arxiv.org/abs/2502.08363v1
- Date: Wed, 12 Feb 2025 12:50:15 GMT
- Title: Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
- Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli,
- Abstract summary: We introduce Top-Theta Attention, or simply Top-$theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds.
This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy.
Unlike top-k attention, Top-$theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search.
- Score: 1.6112718683989882
- License:
- Abstract: The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.
Related papers
- Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.
It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.
We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z) - Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference.
In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z) - ReduceFormer: Attention with Tensor Reduction by Summation [4.985969607297595]
We introduce ReduceFormer, a family of models optimized for efficiency with the spirit of attention.
ReduceFormer leverages only simple operations such as reduction and element-wise multiplication, leading to greatly simplified architecture and improved inference performance.
The proposed model family is suitable for edge devices where compute resource and memory bandwidth are limited, as well as for cloud computing where high throughput is sought after.
arXiv Detail & Related papers (2024-06-11T17:28:09Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z) - Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention.
We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys.
We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z) - On the Distribution, Sparsity, and Inference-time Quantization of
Attention Values in Transformers [13.401707395755746]
We study the full range of typical attention values necessary for NLP tasks.
We find nearly 80% of attention values can be pruned to zeros with minimal ($ 1.0%$) relative loss in accuracy.
We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
arXiv Detail & Related papers (2021-06-02T17:45:47Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.