Related papers: SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

URL: http://arxiv.org/abs/2410.13276v2
Date: Fri, 18 Oct 2024 05:01:11 GMT
Title: SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Authors: Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang,
Abstract summary: We develop a new Attention mechanism that augments conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
Score: 11.334069751834264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.

Related papers

Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity [4.24164487223914]
We introduce Polar Sparsity, highlighting a key shift in sparsity importance from dense to Attention layers as we scale batch size and sequence length.<n>We develop hardware-efficient, sparsity-aware kernels for selective computation and Attention, delivering up to (2.2times) end-to-end speed for models like OPT, LLaMA-2 & 3, across various batch sizes and sequence lengths without compromising accuracy.
arXiv Detail & Related papers (2025-05-20T20:15:42Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
MoBA: Mixture of Block Attention for Long-Context LLMs [46.10222520755179]
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI) Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations. We propose a solution that adheres to the less structure'' principle, allowing the model to determine where to attend autonomously.
arXiv Detail & Related papers (2025-02-18T14:06:05Z)
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach. AttentionPredictor accurately predicts the attention score while consuming negligible memory. We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT. Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework. Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z)
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context. It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z)
ReAttention: Training-Free Infinite Context with Finite Attention Scope [65.91272939057592]
Long-context capability of Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length remains a critical bottleneck limiting their practical applications. We propose bftextReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods.
arXiv Detail & Related papers (2024-07-21T14:23:37Z)
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing. We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z)
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns. ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z)
Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs) We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner. ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models [20.78813311569383]
We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. Various experiments are conducted on different model sizes and sequence lengths.
arXiv Detail & Related papers (2024-01-09T16:27:28Z)
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness. To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z)
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns. We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z)
Towards Using Count-level Weak Supervision for Crowd Counting [55.58468947486247]
This paper studies the problem of weakly-supervised crowd counting which learns a model from only a small amount of location-level annotations (fully-supervised) but a large amount of count-level annotations (weakly-supervised) We devise a simple-yet-effective training strategy, namely Multiple Auxiliary Tasks Training (MATT), to construct regularizes for restricting the freedom of the generated density maps.
arXiv Detail & Related papers (2020-02-29T02:58:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.