Fugu-MT 論文翻訳(概要): ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

論文の概要: ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

arxiv url: http://arxiv.org/abs/2603.10088v1
Date: Tue, 10 Mar 2026 14:31:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.627012
Title: ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping
Title（参考訳）: ES-dLLM:早期スキッピングによる拡散大言語モデルの効率的な推論
Authors: Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma,
Abstract要約: 拡散大言語モデル(dLLMs)は、自己回帰モデル(ARMs)に代わる有望な選択肢として浮上している。我々は、dLLMの生成ダイナミクスを分析し、キー、値、隠された状態を含む中間表現が連続するイテレーション間でのみ微妙に変化することを発見した。我々は,dLLMのトレーニング不要な推論高速化フレームワークである textbfES-dLLM を提案する。
参考スコア（独自算出の注目度）: 26.560813832545563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.
Abstract（参考訳）: 拡散大言語モデル(dLLM)は、双方向コンテキストのキャプチャと並列生成の可能性のため、自動回帰モデル(ARM)に代わる有望な選択肢として浮上している。利点はあるものの、完全な入力コンテキストが反復ごとに処理されるため、dLLM推論は計算コストがかかる。本研究では、dLLMの生成ダイナミクスを分析し、キー、値、隠蔽状態を含む中間表現が連続するイテレーション間でのみ微妙に変化することを示す。この知見を生かして,dLLM のトレーニング不要推論促進フレームワークである \textbf{ES-dLLM} を提案する。トークン重要度は、前のイテレーションの中間テンソル変動と信頼スコアで計算される。 LLaDA-8BとDream-7Bの実験では、ES-dLLMは最大226.57トークンと308.51トークン/秒(TPS)のスループットを実現し、NVIDIA H200 GPU上で5.6$\times$から16.8$\times$バニラ実装のスピードアップと1.85$\times$を生成品質を維持しながら実現している。

論文の概要: ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

関連論文リスト