Efficient Streaming Language Models with Attention Sinks
- URL: http://arxiv.org/abs/2309.17453v4
- Date: Sun, 7 Apr 2024 00:56:53 GMT
- Title: Efficient Streaming Language Models with Attention Sinks
- Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis,
- Abstract summary: StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
- Score: 72.20260088848987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.
Related papers
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [30.722379261991563]
LazyLLM is a method that selectively computes the KV for tokens important for the next token prediction.
We show that LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.
arXiv Detail & Related papers (2024-07-19T06:34:45Z) - SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models [4.497551890206997]
Self-attention mechanism scales quadratically with sequence length.
LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension.
SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
arXiv Detail & Related papers (2024-06-09T07:23:34Z) - SirLLM: Streaming Infinite Retentive LLM [74.40196814292426]
Large Language Models (LLMs) process inputs of any length and maintain a degree of memory.
Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs.
We introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues.
arXiv Detail & Related papers (2024-05-21T06:37:03Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [3.659659889927316]
ChunkAttention is a prefix-aware self-attention module for large language models.
It can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime.
Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$times$ compared to the start-of-the-art implementation.
arXiv Detail & Related papers (2024-02-23T09:29:19Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.