SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models
- URL: http://arxiv.org/abs/2406.05678v1
- Date: Sun, 9 Jun 2024 07:23:34 GMT
- Title: SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models
- Authors: Hengyu Zhang,
- Abstract summary: Self-attention mechanism scales quadratically with sequence length.
LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension.
SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
- Score: 4.497551890206997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extending the functionality of the Transformer model to accommodate longer sequence lengths has become a critical challenge. This extension is crucial not only for improving tasks such as language translation and long-context processing but also for enabling novel applications like chatbots, code generation, and multimedia content creation. The primary obstacle is the self-attention mechanism, which scales quadratically with sequence length in terms of computation time and memory requirements. LongLoRA proposed shifted sparse attention (S\(^2\)-Attn), effectively enabling context extension and leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. However, LongLoRA is still not as efficient as vanilla attention, reaching only 39\% of the perplexity improvement compared to full attention. This inefficiency is due to the cyclic shift applied within different attention head patterns, causing either chaos in the attention head structure or unnecessary information exchange between token groups. To address these issues, We propose \textbf{SinkLoRA}, which features better work partitioning. Specifically, (1) we developed SF-Attn with a segmentation and reassembly algorithm to proportionally return cyclically shifted groups of attention heads to their un-shifted state together with global attention of "sink attention tokens", achieving 92\% of the perplexity improvement compared to full attention after fine tuning, and (2) applied a SOTA KV cache compression algorithm H$_2$O to accelerate inference. Furthermore, We conducted supervised fine-tuning with SinkLoRA using a self collected LongAlpaca-plus dataset. All our code, models, datasets, and demos are available at \url{https://github.com/Dexter-GT-86/SinkLoRA}.
Related papers
- Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.
This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.
We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - Star Attention: Efficient LLM Inference over Long Sequences [17.401430615714]
We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts.
Star Attention integrates seamlessly with most Transformer-based Large Language Models trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
arXiv Detail & Related papers (2024-11-26T05:10:04Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption.
To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.