Fugu-MT 論文翻訳(概要): Nucleus-Image: Sparse MoE for Image Generation

論文の概要: Nucleus-Image: Sparse MoE for Image Generation

arxiv url: http://arxiv.org/abs/2604.12163v1
Date: Tue, 14 Apr 2026 00:43:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.174804
Title: Nucleus-Image: Sparse MoE for Image Generation
Title（参考訳）: Nucleus-Image: 画像生成のためのスパースMOE
Authors: Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, Haozhe Liu,
Abstract要約: 我々はGenEval, DPG-Bench, OneIG-Benchの先頭モデルを超えるテキスト・画像生成モデルを提案する。 Nucleus-Imageはスパース・ミックス・オブ・エキスパート(MoE)拡散変圧器アーキテクチャを採用している。我々は,700万枚の画像にまたがる1.5Bの高品質トレーニングペアからなる大規模トレーニングコーパスを構築した。
参考スコア（独自算出の注目度）: 5.769753912757775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.
Abstract（参考訳）: 我々は,GenEval, DPG-Bench, OneIG-Bench の先行モデルに適合または超過することで, 品質逆効果の新たなパレートフロンティアを確立するテキスト・画像生成モデルである Nucleus-Image を提案する。 Nucleus-Imageでは、Sparse Mixed-of-Experts(MoE)拡散トランスフォーマーアーキテクチャとExpert-Choice Routingを使用して、モデル全体のキャパシティを1層あたり64のルーティングされたエキスパートに対して17Bパラメータに拡張する。我々は、トランスのバックボーンからテキストトークンを完全に排除し、タイムステップ間でテキストKVを共有可能にすることで、推論効率に最適化された合理化アーキテクチャを採用する。時間ステップ変調を用いた場合のルーティング安定性を改善するために,時間ステップ対応の専門家割当と時間ステップ対応の専門家割当を分離する分離されたルーティング設計を導入する。我々は,多段フィルタリング,復号化,審美的階層化,キャプションキュレーションを通じて,700万のユニークな画像にまたがる1.5Bの高品質トレーニングペアからなる大規模トレーニングコーパスを構築した。訓練は、プログレッシブ・レゾリューション・カリキュラム(256から512から1024)に従っており、各段階で複数のアスペクト比バケットと専門家のキャパシティー・ファクターのプログレッシブ・スパシフィケーションを兼ね備えている。我々は、Muonオプティマイザを採用し、時間ステップ変調による拡散モデルに適したパラメータグループ化レシピを共有する。 Nucleus-Imageは、スパースMoEスケーリングが高品質な画像生成への極めて効果的なパスであり、推論コストのごく一部で非常に大きなアクティブパラメータ予算を持つモデルの性能に達することを示した。これらの結果は、強化学習なし、直接選好最適化なし、人間の選好チューニングなしなど、あらゆる種類の訓練後の最適化なしに達成される。我々はトレーニングレシピをリリースし、Nucleus-Imageをこの品質で最初のオープンソースMoE拡散モデルにしました。

論文の概要: Nucleus-Image: Sparse MoE for Image Generation

関連論文リスト