Fugu-MT 論文翻訳(概要): Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

論文の概要: Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

arxiv url: http://arxiv.org/abs/2509.26520v1
Date: Tue, 30 Sep 2025 16:56:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.217347
Title: Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization
Title（参考訳）: 弾性推論-時間的エキスパート利用のための訓練用マトリルシュカミクチャー-オブ・エクササイズ
Authors: Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su,
Abstract要約: Matryoshka MoE(M-MoE)は、エキスパートアンサンブルに直接粗い構造を注入するトレーニングフレームワークである。私たちの作業は、大規模MOEモデルのより実用的で適応可能なデプロイメントの道を開くものです。
参考スコア（独自算出の注目度）: 60.309915093470416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
Abstract（参考訳）: Mixture-of-Experts (MoE) は、計算コストの比例的な増加を伴わずに、大規模言語モデルを効率的にスケールするための有望なパラダイムとして登場した。しかし、Top-Kルータの標準的なトレーニング戦略は、MoEモデルが弾性推論の完全な可能性を実現するのを妨げている。アクティベートした専門家の数が推測時に変化すると、これらのモデルは急激な性能劣化を示す。本研究では,M-MoE(Matryoshka MoE,M-MoE)について紹介する。トレーニング中にアクティベートされた専門家の数を体系的に変化させることで、M-MoEは有意義なランキングを学ぶためのモデルを補完する。我々は、この原理を複数の粒度で探求し、最も効果的である層ワイドなランダム化戦略を同定する。実験により,M-MoEモデルが一組のスペシャリストモデルと密に一致しているが,トレーニングコストのごく一部に過ぎず,優れた弾力性が得られることが示された。この柔軟性は弾性推論を解放するだけでなく、異なる計算予算を異なるモデル層に割り当てることでパフォーマンスを最適化する。私たちの作業は、大規模MOEモデルのより実用的で適応可能なデプロイメントの道を開くものです。

論文の概要: Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

関連論文リスト