Fugu-MT 論文翻訳(概要): Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

論文の概要: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

arxiv url: http://arxiv.org/abs/2508.18672v1
Date: Tue, 26 Aug 2025 04:31:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.680924
Title: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Title（参考訳）: タスクの共振のためのMixture-of-Experts言語モデルの最適スペーサ性
Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota,
Abstract要約: 現在、Mixture-of-Experts (MoE)モデルは最先端システムでは標準となっている。記憶と推論という2つの異なる能力体制にMoEがどのような影響を及ぼすかを検討する。
参考スコア（独自算出の注目度）: 17.067788440109137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.
Abstract（参考訳）: 経験的スケーリング法則は、大規模言語モデル(LLM)の進化を駆動しているが、モデルアーキテクチャやデータパイプラインが変化するたびにその係数は変化している。現在最先端システムで標準となっているMixture-of-Experts (MoE)モデルは、現在の高密度モデルフロンティアが見落としている新しい空間次元を導入している。記憶と推論という2つの異なる能力体制にMoEがどのような影響を及ぼすかを検討する。計算予算の固定を保ちながら、全パラメータ、アクティブパラメータ、および上位$kのルーティングを体系的に変更するMoE変換器のファミリーを訓練する。すべてのモデルに対して、トレーニング前損失、ダウンストリームタスク損失、タスク精度を記録し、列車-テストの一般化ギャップと損失-精度ギャップを分離できるようにします。メモリ化ベンチマークは、総パラメータで単調に改善し、トレーニング損失をミラーリングする。対照的に、推論性能は飽和し、総パラメータとトレーニング損失の両方が引き続き上昇しているにもかかわらず、後退する。上位k$だけでは、アクティブパラメータが一定である場合にはほとんど効果がなく、学習率や初期化のような古典的なハイパーパラメータは、空間性と同じ方向の一般化ギャップを変調する。訓練後の強化学習(GRPO)や、余分なテストタイム計算は、過度にスパースなモデルの理由の不足を救えない。私たちのモデルチェックポイント、コード、ログはhttps://github.com/rioyokotalab/Optimal-sparsity.comでオープンソースです。

論文の概要: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

関連論文リスト