Fugu-MT 論文翻訳(概要): LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

論文の概要: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

arxiv url: http://arxiv.org/abs/2602.04541v1
Date: Wed, 04 Feb 2026 13:34:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 19:45:11.540339
Title: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
Title（参考訳）: LycheeDecode: ハイブリッドヘッドスパースデコーディングによる長期LLM推論の高速化
Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang,
Abstract要約: LLM(Long-context Large Language Model)は、デコード中に急速に拡大するキーバリューキャッシュという、重要なボトルネックを露呈する。我々は,微細なハイブリッドヘッドアテンション機構を中心とした効率的な復号法であるLycheeDecodeを提案する。我々はLycheeDecodeが、フルアテンションベースラインに匹敵する、時には超越した生成品質を達成することを実証した。
参考スコア（独自算出の注目度）: 27.856769454125573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
Abstract（参考訳）: 長期コンテキストの大規模言語モデル(LLM)の普及は、デコード中のキー値キャッシュの急激な拡大という大きなボトルネックを露呈する。近年のアプローチでは、レイヤ間で重要なトークンを1セット共有することで、これを緩和しようとしているが、そのような粗い粒度の共有は、注意ヘッドの機能的多様性を無視してモデルパフォーマンスを損なう。そこで本研究では,ハードウェア効率の高いトップk選択戦略を用いる,微細なハイブリッドヘッドアテンション機構を中心に,効率的なデコーディング手法であるLycheeDecodeを提案する。具体的には、新しいHardKumaベースのメカニズムは、注意を、重要なトークンを動的に識別する検索ヘッドの小さなサブセットと、それらを効率的な計算のために再利用するスパースヘッドに分割する。 Llama3 や Qwen3 のような先進的なモデルに関する広範な実験を通じて、LycheeDecode は、LycheeDecode が、LycheeDecode に匹敵する生成的品質を達成し、時にはフルアテンションベースラインを超越する、様々なベンチマーク(例えば、LongBench, RULER)と複雑な推論(例えば、AIME24, OlympiadBench)を経る。重要なことに、これは最大2.7倍のスピードアップで128Kのコンテキスト長で達成される。注目ヘッドの機能的多様性を保ちながら,既存の手法の性能ボトルネックを克服し,高効率かつ高品質なLLM推論を実現する。

論文の概要: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

関連論文リスト