LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
- URL: http://arxiv.org/abs/2602.04541v1
- Date: Wed, 04 Feb 2026 13:34:12 GMT
- Title: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
- Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang,
- Abstract summary: Long-context large language models (LLMs) expose a key bottleneck: the rapidly expanding key-value cache during decoding.<n>We propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism.<n>We demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing the full-attention baseline.
- Score: 27.856769454125573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
Related papers
- Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference [9.469995152350899]
We propose Kascade, a training-free sparse attention method that leverages known observations.<n>Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers.<n>Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs.
arXiv Detail & Related papers (2025-12-18T10:37:14Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z) - LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
LongSpec is a framework that addresses the challenges of efficient inference over long contexts.<n>LongSpec achieves up to a 3.26x speedup over strong Flash Attention baselines.<n>The code is available at https://github.com/sail-sg/LongSpec.
arXiv Detail & Related papers (2025-02-24T18:53:31Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks.
We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information.
It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z) - TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention [7.4088392854630625]
Large language models (LLMs) have driven significant advancements across diverse NLP tasks.
This paper introduces TidalDecode, a system for fast and accurate LLM decoding through position persistent sparse attention.
arXiv Detail & Related papers (2024-10-07T14:30:27Z) - LongHeads: Multi-Head Attention is Secretly a Long Context Processor [49.1661870007655]
LongHeads is a training-free framework that enhances large language models' long context ability.
Instead of allowing each head to attend to the full sentence, we allow each head to process in-distribution length by selecting and attending to context chunks.
LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task.
arXiv Detail & Related papers (2024-02-16T13:39:34Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.