Related papers: MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

URL: http://arxiv.org/abs/2406.14909v2
Date: Fri, 01 Nov 2024 02:26:18 GMT
Title: MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Authors: Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang,
Abstract summary: We propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts.
Score: 22.038650467915176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-{7B,13B}, and Llama3-{8B,70B} models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction, boosting decode throughput by $6.6-8.2\times$ and $1.7-1.9\times$ compared to FlashAttention2 and vLLM, with minimal impact on performance. Our code is available at \url{https://github.com/thu-nics/MoA}.

Related papers

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing [30.941881811797515]
We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. We show that MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget.
arXiv Detail & Related papers (2025-05-01T05:22:11Z)
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference [21.47425403468577]
We propose SpargeAttn, a universal sparse and quantized attention for any model. Our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics.
arXiv Detail & Related papers (2025-02-25T12:02:17Z)
Scaling Embedding Layers in Language Models [52.47659840377581]
SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
arXiv Detail & Related papers (2025-02-03T18:59:32Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
This paper introduces a novel approach to accelerate attention computation in Large Language Models (LLMs) We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention. Our method introduces no error for ReLU attention and only provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context. It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
Simple linear attention language models balance the recall-throughput tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z)
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores. We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z)
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns. We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC) ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z)
Efficient Content-Based Sparse Attention with Routing Transformers [34.83683983648021]
Self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Our work proposes to learn dynamic sparse attention patterns that avoid allocating and memory to attend to content unrelated to the query of interest.
arXiv Detail & Related papers (2020-03-12T19:50:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.