Fugu-MT 論文翻訳(概要): MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

論文の概要: MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

arxiv url: http://arxiv.org/abs/2603.16077v1
Date: Tue, 17 Mar 2026 02:54:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.077785
Title: MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
Title（参考訳）: MDM-Prime-v2: 拡散言語モデルの計算最適スケーリングを可能にするバイナリエンコーディングとインデックスシャッフル
Authors: Chen-Hao Chao, Wei-Fang Sun, Junwei Qua, Chun-Yi Lee, Rahul G. Krishnan,
Abstract要約: マスケ拡散モデル(MDM)は部分マスキングスキーム(Prime)を用いて学習するとより優れた一般化を示すバイナリーセンスとシャッフルを組み込んだマスク付き拡散言語モデルであるMDM-Prime-v2を開発した。計算最適比較では、MDM-Prime-v2はOpenWebText上で7.77パープレキシティを達成し、ARM(12.99)、DM(18.94)、MDM(13.41)を上回った。
参考スコア（独自算出の注目度）: 26.967863200265494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8$\times$ more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.
Abstract（参考訳）: 仮面拡散モデル(MDM)は部分マスキングスキーム(Prime)を用いて学習するとより優れた一般化を示す。このアプローチはトークンをサブトークンに変換し、サブトークンレベルで拡散過程をモデル化する。 MDM-Primeフレームワークには2つの制限がある。まず、サブトークン化器におけるトークン粒度のハイパーパラメータ選択を導くためのツールが欠如している。第2に,一般に使用されているByte-Pair-Encoding(BPE)トークンと組み合わせた場合,サブトケナイザの関数形式は推定精度を著しく低下させることがわかった。これらの制約に対処するため、MDM-Primeにおける変動境界の厳密性について検討し、バイナリエンコーディングとインデックスシャッフルを組み込んだマスク付き拡散言語モデルであるMDM-Prime-v2を開発した。我々のスケーリング分析によると、MDM-Prime-v2は自己回帰モデル(ARM)よりも21.8$\times$の計算効率が高い。計算最適比較では、MDM-Prime-v2はOpenWebText上で7.77パープレキシティを獲得し、ARM(12.99)、MDM (18.94)、MDM-Prime(13.41)を上回っている。モデルのサイズを1.1Bパラメータに拡張すると、様々なコモンセンス推論タスクにおいて、より優れたゼロショット精度が示される。

論文の概要: MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

関連論文リスト