TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- URL: http://arxiv.org/abs/2410.05076v1
- Date: Mon, 7 Oct 2024 14:30:27 GMT
- Title: TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia,
- Abstract summary: Large language models (LLMs) have driven significant advancements across diverse NLP tasks.
This paper introduces TidalDecode, a system for fast and accurate LLM decoding through position persistent sparse attention.
- Score: 7.4088392854630625
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.
Related papers
- Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [23.20856449846164]
TokenSelect is a model-agnostic, training-free method for efficient and accurate long-context inference.
A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention and up to 2.28x acceleration in end-to-end latency.
arXiv Detail & Related papers (2024-11-05T07:56:24Z) - FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [11.146015814220858]
FIRST is an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.
Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task.
arXiv Detail & Related papers (2024-10-16T12:45:35Z) - ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs)
ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens.
Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.
Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.
We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z) - FlexAttention for Efficient High-Resolution Vision-Language Models [67.82024785677801]
We propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models.
A high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized.
Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs.
arXiv Detail & Related papers (2024-07-29T17:59:05Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - LLMCad: Fast and Scalable On-device Large Language Model Inference [11.103824752113148]
Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications.
Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs)
We introduce LLMCad, an on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks.
arXiv Detail & Related papers (2023-09-08T10:44:19Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.