Fugu-MT 論文翻訳(概要): LLaDA-MoE: A Sparse MoE Diffusion Language Model

論文の概要: LLaDA-MoE: A Sparse MoE Diffusion Language Model

arxiv url: http://arxiv.org/abs/2509.24389v1
Date: Mon, 29 Sep 2025 07:38:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.823588
Title: LLaDA-MoE: A Sparse MoE Diffusion Language Model
Title（参考訳）: LLaDA-MoE: Sparse MoE Diffusion Language Model
Authors: Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen,
Abstract要約: LLaDA-MoEはMixture-of-Experts (MoE)アーキテクチャを持つ大規模言語拡散モデルである。 LLaDA-MoEは計算オーバーヘッドを大幅に削減して競合性能を達成する。この結果から,マスク拡散言語モデルの学習目標に疎結合したMoEアーキテクチャを組み込むことで,MoEの強みがもたらされることが示唆された。
参考スコア（独自算出の注目度）: 88.96960440635992
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.
Abstract（参考訳）: 約20TトークンをスクラッチからトレーニングしたMixture-of-Experts (MoE) アーキテクチャを備えた大規模言語拡散モデルであるLLaDA-MoEを紹介する。 LLaDA-MoEは、7Bパラメータのキャパシティを維持しながら、推論中に1.4Bパラメータのみを活性化することにより、計算オーバーヘッドを大幅に削減した競合性能を実現する。 LLaDA, LLaDA 1.5, Dreamを複数ベンチマークで比較したところ, LLaDA-MoEは, 従来の拡散言語モデルよりも大きなパラメータを持つ拡散言語モデル間で, 最先端の性能を実現していることがわかった。 Instruct-tuned model LLaDA-MoE-7B-A1B-InstructはQwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment task, using less active parameters。この結果から, マスク付き拡散言語モデルの学習目標に疎らなMoEアーキテクチャを組み込むことで, 有効パラメータの少ない効率的な推論の下でもMoEの強みを生かし, 拡散言語モデルのさらなる探索に十分な余地が開けることが示唆された。 LLaDA-MoEモデルはHugingfaceで利用可能である。

論文の概要: LLaDA-MoE: A Sparse MoE Diffusion Language Model

関連論文リスト