Fugu-MT 論文翻訳(概要): Scaling Beyond Masked Diffusion Language Models

論文の概要: Scaling Beyond Masked Diffusion Language Models

arxiv url: http://arxiv.org/abs/2602.15014v1
Date: Mon, 16 Feb 2026 18:54:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 16:22:50.639674
Title: Scaling Beyond Masked Diffusion Language Models
Title（参考訳）: マスケ拡散言語モデルを越えたスケーリング
Authors: Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic,
Abstract要約: 本稿では、一様状態と補間離散拡散法の最初のスケーリング法則について述べる。単純なクロスエントロピーで学習すると,Masked拡散モデルによりFLOPs効率が約12%向上することを示す。
参考スコア（独自算出の注目度）: 18.68471174706656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms
Abstract（参考訳）: 拡散言語モデルは、より高速な生成の可能性のため、自己回帰モデルに代わる有望な選択肢である。個別の拡散アプローチの中で、現在マスケッド拡散が支配的であり、主に言語モデリングベンチマークにおける強い難易度によって引き起こされている。本研究では、一様状態の法則と離散拡散法を補間する最初のスケーリング法則について述べる。また,単純なクロスエントロピーで学習した場合,Masked拡散モデルによりFLOPs効率が約12%向上することを示した。拡散系では難易度は有益であるが, 速度品質のパレートフロンティアに反映されるように, より速く, より実用的なサンプリングにより, スケーリングのリスクが低いモデルの方が好まれる。これらの結果は、マスケ拡散は、拡散言語モデリングの未来を分類的に表し、パープレキシティだけでは、交叉アルゴリズムの比較に十分である、という見解に挑戦する。提案手法を1.7Bパラメータに拡張すると,一様拡散は確率ベースのベンチマークで競争力を維持し,GSM8K上での自己回帰拡散モデルやMasked拡散モデルよりも優れる。私たちはプロジェクトページでコード、モデルチェックポイント、ビデオチュートリアルを提供しています。

論文の概要: Scaling Beyond Masked Diffusion Language Models

関連論文リスト