Related papers: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

URL: http://arxiv.org/abs/2602.04541v1
Date: Wed, 04 Feb 2026 13:34:12 GMT
Title: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang,
Abstract summary: Long-context large language models (LLMs) expose a key bottleneck: the rapidly expanding key-value cache during decoding.<n>We propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism.<n>We demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing the full-attention baseline.
Score: 27.856769454125573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.

Related papers

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference [9.469995152350899]
We propose Kascade, a training-free sparse attention method that leverages known observations.<n>Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers.<n>Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs.
arXiv Detail & Related papers (2025-12-18T10:37:14Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z)
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
LongSpec is a framework that addresses the challenges of efficient inference over long contexts.<n>LongSpec achieves up to a 3.26x speedup over strong Flash Attention baselines.<n>The code is available at https://github.com/sail-sg/LongSpec.
arXiv Detail & Related papers (2025-02-24T18:53:31Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information. It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z)
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention [7.4088392854630625]
Large language models (LLMs) have driven significant advancements across diverse NLP tasks. This paper introduces TidalDecode, a system for fast and accurate LLM decoding through position persistent sparse attention.
arXiv Detail & Related papers (2024-10-07T14:30:27Z)
LongHeads: Multi-Head Attention is Secretly a Long Context Processor [49.1661870007655]
LongHeads is a training-free framework that enhances large language models' long context ability. Instead of allowing each head to attend to the full sentence, we allow each head to process in-distribution length by selecting and attending to context chunks. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task.
arXiv Detail & Related papers (2024-02-16T13:39:34Z)
SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation. In this work, we focus on developing an efficient compression technique for the KV cache. We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.