Fugu-MT 論文翻訳(概要): Sumi: Open Uniform Diffusion Language Model from Scratch

論文の概要: Sumi: Open Uniform Diffusion Language Model from Scratch

arxiv url: http://arxiv.org/abs/2606.19005v1
Date: Wed, 17 Jun 2026 12:32:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.162565
Title: Sumi: Open Uniform Diffusion Language Model from Scratch
Title（参考訳）: umi: Scratchからの一様拡散言語モデル
Authors: Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki,
Abstract要約: umiは、1.5Tトークンのスクラッチから事前訓練された、完全にオープンな7B均一拡散言語モデルである。利用可能なコーパス上のデータ混在の完全な仕様を含む、モデルウェイト、チェックポイント、および完全なトレーニングレシピをリリースします。
参考スコア（独自算出の注目度）: 13.559605580540293
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.
Abstract（参考訳）: 拡散モデルは自己回帰モデルに代わる有望な選択肢となっている。これらのうち、一様拡散言語モデル(UDLM)は任意のステップで任意のトークンを更新することができ、原則としてより柔軟な生成を可能にする。しかし、UDLMは大きなパラメータスケールと大きなトークン予算の両方でゼロから事前訓練されていない。自己回帰モデリングとマスク拡散モデリングはどちらも、コミュニティが研究し、構築できる規模の有能なモデルを持っている。 Scratch-pretrained UDLMは、既存の自己回帰的およびマスク付き拡散モデルに対するスケーリング行動、生成ダイナミクス、制御可能性、トレードオフを研究するためのきれいな基準点を提供する。この目的のために,1.5Tトークン上でスクラッチから事前学習した、7B一様拡散言語モデルであるSumi(シンク)を紹介した。 umiは、知識、推論、コーディングベンチマークで同等のトークン予算でトレーニングされた自己回帰モデルと競合する一方で、私たちの教育と重大なデータ混在が寄与する可能性のあるCommonsenseベンチマークでは、パフォーマンスが低かった。利用可能なコーパス上のデータ混在の完全な仕様を含む、モデルウェイト、チェックポイント、および完全なトレーニングレシピをリリースします。このリリースにより、コミュニティは大規模にネイティブな均一な拡散を研究でき、理解の不十分な側面における作業の触媒になることを期待しています。

論文の概要: Sumi: Open Uniform Diffusion Language Model from Scratch

関連論文リスト