Fugu-MT 論文翻訳(概要): Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

論文の概要: Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

arxiv url: http://arxiv.org/abs/2509.25188v2
Date: Fri, 03 Oct 2025 00:40:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 12:05:48.051544
Title: Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding
Title（参考訳）: 並列学習: 学習可能な並列デコーディングによる拡散大言語モデルの高速化
Authors: Wenrui Bao, Zhiben Chen, Dan Xu, Yuzhang Shang,
Abstract要約: 大規模言語モデル(LLM)における自己回帰復号には、$n$トークンに対して$mathcalO(n)$シーケンシャルステップが必要である。本稿では,並列デコード学習(Learn2PD)を提案する。これは軽量かつ適応的なフィルタモデルをトレーニングし,各トークン位置に対して,現在の予測が最終出力と一致するかどうかを予測するフレームワークである。この学習されたフィルタは、正しく予測された場合にのみトークンをアンマスクするオラクル並列復号法を近似する。
参考スコア（独自算出の注目度）: 21.609237262034636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.
Abstract（参考訳）: 大規模言語モデル(LLM)における自己回帰復号には、$n$トークンに対して$\mathcal{O}(n)$シーケンシャルステップが必要である。近年の拡散型LDM(dLLMs)は,反復的復調による並列トークン生成を可能にする。しかし、現在の並列デコーディング戦略は、入力固有の特性に適応できない固定された入力非依存のヒューリスティック(例えば、信頼しきい値)に依存しており、様々なNLPタスク間での最適速度品質のトレードオフをもたらす。本研究では、並列デコードに対するより柔軟な動的アプローチについて検討する。本稿では,並列デコード学習(Learn2PD)を提案する。これは軽量で適応的なフィルタモデルをトレーニングし,各トークン位置に対して,現在の予測が最終出力と一致するかどうかを予測するフレームワークである。この学習されたフィルタは、正しく予測された場合にのみトークンをアンマスクするオラクル並列復号法を近似する。重要なことは、フィルタモデルはトレーニング後の方法で学習され、それを最適化するために少量の計算しか必要としない(数分レベルのGPU時間)。さらに、シーケンス終了時の復号完了を検出するためにEnd-of-Text Prediction (EoTP)を導入し、パディングトークンの冗長な復号を回避する。 LLaDAベンチマークの実験では,最大22.58$\times$を性能低下なく,最大57.51$\times$をKVキャッシュと組み合わせて実現している。

論文の概要: Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

関連論文リスト