Fugu-MT 論文翻訳(概要): SimSD: Simple Speculative Decoding in Diffusion Language Models

論文の概要: SimSD: Simple Speculative Decoding in Diffusion Language Models

arxiv url: http://arxiv.org/abs/2606.02544v1
Date: Mon, 01 Jun 2026 17:46:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.54997
Title: SimSD: Simple Speculative Decoding in Diffusion Language Models
Title（参考訳）: SimSD: 拡散言語モデルにおける単純な投機的デコーディング
Authors: Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang, Haoru Li, Chaojie Ren, Yiming Huang, Kaijie Zhu, Zhongkai Yu, Kun Zhou, Jingbo Shang,
Abstract要約: 拡散大言語モデル (dLLMs) は、並列またはブロックワイド復号による高速な推論を提供する。彼らのマスク付き言語モデリングの定式化は、標準的なトークンレベルの投機的復号法とは相容れないままである。我々は,dLLMに時間的に有効なトークンレベルのコンテキストを付与する,SimSDと呼ばれるdLLMの投機的復号アルゴリズムを提案する。提案手法は,平均生成品質を維持しつつ,最大7.46倍高い復号スループットを実現する。
参考スコア（独自算出の注目度）: 61.33773959352141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.
Abstract（参考訳）: 拡散大言語モデル(dLLM)は、最近、自動回帰(AR)LLMの代替として、並列またはブロックワイド復号による高速な推論を提供する有望なものとして登場した。しかし、それらのマスク付き言語モデリングの定式化は、ARモデルの最も効果的な加速技術の一つである標準的なトークンレベルの投機的復号法とは相容れないままである。 ARデコーディングでは、因果マスクは時間的に有効なトークンレベルのコンテキストを保持し、ターゲットモデルが単一のフォワードパスで複数のトークンを検証できるようにする。対照的に、dLLMはマスクトークンと双方向の注意を頼りにしており、効果的なコンテキストがデノナイズステップによって変化し、直接トークンレベルの投機的検証が防止される。このギャップを埋めるために,DLLMに時間的に有効なトークンレベルのコンテキストを付与するプラグイン・アンド・プレイマスキング戦略を主に採用したSimSDという,拡散言語モデルのための単純かつ効果的な投機的復号法を提案する。提案手法は, ドラフトモデル予測からの参照トークンを明示的に導入し, 注目マスクを設計し, 現行のステップトークンとの相互作用を制御し, 単一前方通過におけるドラフトトークンの有効なロジットをdLLMが計算できるようにする。これにより、ARモデルにおける因果マスキングによる鍵検証能力が復元され、dLLMの並列復号化の利点が保たれる。提案手法はトレーニングフリーであり、KVキャッシュやブロックワイド復号法といった他の加速技術と柔軟に統合できる。 SDAR系dLLMを4つのベンチマークで実験した結果,平均生成品質を維持・改善しながら,最大7.46倍高い復号スループットが得られることがわかった。

論文の概要: SimSD: Simple Speculative Decoding in Diffusion Language Models

関連論文リスト