Fugu-MT 論文翻訳(概要): Attention Is All You Need for KV Cache in Diffusion LLMs

論文の概要: Attention Is All You Need for KV Cache in Diffusion LLMs

arxiv url: http://arxiv.org/abs/2510.14973v1
Date: Thu, 16 Oct 2025 17:59:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:15.006947
Title: Attention Is All You Need for KV Cache in Diffusion LLMs
Title（参考訳）: 拡散LDMにおけるKVキャッシュに必要な注意
Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen,
Abstract要約: Elastic-Cacheは、拡散大言語モデルのための適応型層対応キャッシュ更新を実行する。提案手法は,既存の信頼度に基づく手法よりも高いスループット(GSM8Kで6.8時間)を実現する。
参考スコア（独自算出の注目度）: 36.94369617373333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
Abstract（参考訳）: 本研究では,分散大言語モデル(DLM)のキー値キャッシュを適応的に再計算し,デコード遅延を最小限に抑えながら予測精度を最大化する方法について検討する。従来のメソッドのデコーダは、ほとんどのステップ、特に浅いレイヤでKV状態がほとんど変化しないにも関わらず、各デノーシングステップとレイヤで全てのトークンに対してQKVを再計算し、実質的な冗長性をもたらす。距離${\bf MASK}$トークンは、主に長さバイアスとして機能し、アクティブな予測ウィンドウを超えてブロック的にキャッシュすることができる; (2) KVダイナミクスは深さとともに増加し、深い層から始まる選択的なリフレッシュが十分であることを示す; 3) 最も注目されているトークンは、最小のKVドリフトを示し、他のトークンに対するキャッシュ変更の保守的な下限を提供する。これらに基づいて、トレーニングフリーでアーキテクチャに依存しない戦略である${\bf Elastic-Cache}$を提案し、${when}$をリフレッシュ(最も注目されたトークンのドリフトテストを通じて)と${where}$をリフレッシュ(浅い層キャッシュとオフウィンドウのMASKキャッシュを再利用しながら、選択した層から前方に再計算するdeep-awareスケジュールを介して)を共同で決定します。固定周期スキーマとは異なり、Elastic-Cacheは拡散LDMに対して適応的な層対応キャッシュ更新を実行し、冗長な計算を減らし、生成品質を損なうことなくデコーディングを高速化する。 LLaDA-Instruct、LLaDA-1.5、LLaDA-Vの数学的推論およびコード生成タスクによる実験では、GSM8K(256トークン)での8.7\times$、長いシーケンスでの45.1\times$、HumanEvalでの4.8\times$など、一貫したスピードアップが示されている。提案手法は,既存の信頼性ベースのアプローチよりも高いスループット(GSM8Kで6.8\times$)を実現し,生成品質を保ちながら,拡散LDMの実用的展開を実現している。

論文の概要: Attention Is All You Need for KV Cache in Diffusion LLMs

関連論文リスト