Fugu-MT 論文翻訳(概要): DFlash: Block Diffusion for Flash Speculative Decoding

論文の概要: DFlash: Block Diffusion for Flash Speculative Decoding

arxiv url: http://arxiv.org/abs/2602.06036v1
Date: Thu, 05 Feb 2026 18:59:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:09.154684
Title: DFlash: Block Diffusion for Flash Speculative Decoding
Title（参考訳）: DFlash:Flashの投機的デコードのためのブロック拡散
Authors: Jian Chen, Yesheng Liang, Zhijian Liu,
Abstract要約: 自己回帰型大規模言語モデル(LLM)は高い性能を提供するが、本質的にシーケンシャルなデコーディングを必要とする。本稿では,並列起草のための軽量ブロック拡散モデルを用いた投機的復号化フレームワークであるDFlashを紹介する。
参考スコア（独自算出の注目度）: 11.98141750480807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
Abstract（参考訳）: 自己回帰型大規模言語モデル(LLM)は、強力なパフォーマンスを提供するが、本質的にシーケンシャルなデコードを必要とするため、推論レイテンシが高く、GPU使用率が低い。投機的復号化は、目標LLMによって出力が並列に検証される高速なドラフトモデルを使用することで、このボトルネックを緩和するが、既存の手法では、逐次的かつ実用的なスピードアップを制限する自己回帰的起草に依存している。拡散LDMは、並列生成を可能にすることで有望な代替手段を提供するが、現在の拡散モデルは、自己回帰モデルと比較すると、典型的には性能が劣る。本稿では,並列起草のための軽量ブロック拡散モデルを用いた投機的復号化フレームワークであるDFlashを紹介する。単一のフォワードパスでドラフトトークンを生成し、ターゲットモデルから抽出したコンテキスト特徴に基づいてドラフトモデルを条件付けすることにより、DFlashは高品質な出力と高い受け入れ率で効率的なドラフトを可能にする。実験の結果、DFlashは様々なモデルやタスクで6倍のロスレス加速を実現し、最先端の投機的復号法であるEAGLE-3の2.5倍の高速化を実現している。

論文の概要: DFlash: Block Diffusion for Flash Speculative Decoding

関連論文リスト