Fugu-MT 論文翻訳(概要): Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

論文の概要: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

arxiv url: http://arxiv.org/abs/2508.09192v1
Date: Fri, 08 Aug 2025 04:51:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.598132
Title: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Title（参考訳）: 拡散LDMは離散拡散強制によりより高速なタン-AR推論を行うことができる
Authors: Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng,
Abstract要約: Diffusion Large Language Models (dLLMs) は、テキスト生成のための自動回帰(AR) LLM に代わる有望な代替品として登場した。本稿では、離散拡散強制(D2F)と呼ばれる単純かつ効果的な戦略に基づいて、この障壁を破る。このようにして、バニラdLLMは効率的な推論のためにAR拡散ハイブリッドパラダイムに再構成される。
参考スコア（独自算出の注目度）: 14.22753953706955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.
Abstract（参考訳）: Diffusion Large Language Models (dLLMs) は、テキスト生成のための自動回帰(AR) LLMの代替として、単一のイテレーションで複数のトークンをデコードする可能性を持つ有望な選択肢として登場した。しかし、既存のオープンソースdLLMは、同じ大きさのAR LLMよりも優れた推論速度を達成していない。本稿では,離散拡散強制(D2F)と呼ばれるシンプルで効果的な戦略に基づいて,この障壁を破る。 D2Fは、(1)KVキャッシュの利用を可能にするブロックワイドの自己回帰生成、(2)ブロック間並列復号のための先行ブロックの完了を必要とせず、フォロートークンの予測という2つの重要な機能を持つ。このようにして、バニラdLLMは効率的な推論のためのAR拡散ハイブリッドパラダイムに再構成される。 D2Fは、事前訓練されたdLLMに基づいて非対称蒸留法で実装することができる。さらに、効率と効率のトレードオフを可能にするパイプライン並列復号アルゴリズムを提案する。経験的に、D2F dLLMsはGSM8K上のLLaMA3やQwen2.5よりも$\mathbf{2.5\times}$推論速度が速い。 LLaDAやDreamのようなバニラdLLMと比較して、アクセラレーションは同等の出力品質を維持しながら$\mathbf{50\times}$以上になる。コードはhttps://github.com/zhijie-group/Discrete-Diffusion-Forcing.comで公開されている。

論文の概要: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

関連論文リスト