SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
- URL: http://arxiv.org/abs/2410.13276v2
- Date: Fri, 18 Oct 2024 05:01:11 GMT
- Title: SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
- Authors: Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang,
- Abstract summary: We develop a new Attention mechanism that augments conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse.
SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
- Score: 11.334069751834264
- License:
- Abstract: Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - MoBA: Mixture of Block Attention for Long-Context LLMs [46.10222520755179]
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI)
Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations.
We propose a solution that adheres to the less structure'' principle, allowing the model to determine where to attend autonomously.
arXiv Detail & Related papers (2025-02-18T14:06:05Z) - AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.
AttentionPredictor accurately predicts the attention score while consuming negligible memory.
We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT.
Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework.
Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns.
ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z) - Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption.
To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models [20.78813311569383]
We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits.
Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
Various experiments are conducted on different model sizes and sequence lengths.
arXiv Detail & Related papers (2024-01-09T16:27:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.