Fugu-MT 論文翻訳(概要): WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

論文の概要: WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

arxiv url: http://arxiv.org/abs/2512.22737v1
Date: Sun, 28 Dec 2025 01:25:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:30.194876
Title: WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
Title（参考訳）: WeDLM:高速推論のための標準因果注意による拡散言語モデルの再構成
Authors: Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou,
Abstract要約: 本稿では,標準因果注意に基づく拡散復号化フレームワークWeDLMを提案する。 WeDLMは強力なARバックボーンの品質を維持しつつ,大幅な高速化を実現している。
参考スコア（独自算出の注目度）: 44.87788417755154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
Abstract（参考訳）: 自己回帰(AR)生成は、LLM(Large Language Models)の標準デコーディングパラダイムであるが、トークン・バイ・トークンの性質は推論時に並列性を制限している。拡散言語モデル(DLLM)は、ステップ毎に複数のマスク付きトークンを復元することで並列デコーディングを提供するが、実際には、最適化されたARエンジン(例えば、vLLM)よりも、この並列処理をデプロイメント速度に変換することができないことが多い。主な理由は、多くのDLLMが双方向の注意に依存しており、標準的なプレフィックスKVキャッシングを破り、文脈の繰り返しを強制し、効率を損なうためである。並列生成プレフィックスキャッシュを親しみやすいものにするために,標準因果注意に基づく拡散復号化フレームワークWeDLMを提案する。中心となる考え方は、観測されたトークンを物理接頭辞に移動させるトポロジカル・リオーダリングによって達成された厳密な因果マスクを維持しながら、現在観察されている全てのトークンに対して、それぞれの位置条件をマスクすることである。この特性に基づいて,信頼性の高いトークンを増大する左から右へのプレフィックスに連続的にコミットし,固定された並列処理を継続するストリーミング復号処理を導入し,ブロック拡散法に共通する停止・待機動作を回避する。実験によると、WeDLMは強力なARバックボーンの品質を保ちながら、相当なスピードアップを実現し、挑戦的な推論ベンチマークに3倍、低エントロピー生成レジームに最大10倍近づいた。

論文の概要: WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

関連論文リスト