Fugu-MT 論文翻訳(概要): Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

論文の概要: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

arxiv url: http://arxiv.org/abs/2508.18672v2
Date: Thu, 25 Sep 2025 14:09:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 14:16:56.015451
Title: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Title（参考訳）: タスクの共振のためのMixture-of-Experts言語モデルの最適スペーサ性
Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota,
Abstract要約: 現在、Mixture-of-Experts (MoE)モデルは最先端システムでは標準となっている。記憶能力と推論能力の2つの異なる能力体制にMoEがどのような影響を及ぼすかを検討する。
参考スコア（独自算出の注目度）: 17.067788440109137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.
Abstract（参考訳）: 経験的スケーリング法則は、大規模言語モデル(LLM)の進化を駆動しているが、モデルアーキテクチャやデータパイプラインが変化するたびにその係数は変化している。現在最先端システムで標準となっているMixture-of-Experts (MoE)モデルは、現在の高密度モデルフロンティアが見落としている新しい空間次元を導入している。記憶能力と推論能力の2つの異なる能力体制にMoEがどのような影響を及ぼすかを検討する。固定された計算予算の下で全パラメータ、アクティブパラメータ、およびトップ$kのルーティングを変化させるMOEファミリーをトレーニングすることにより、下流の精度から事前学習損失を減らします。私たちの結果は2つの原則を明らかにします。第一に、Active FLOPs: 同じトレーニング損失を持つが、よりアクティブな計算がより高い推論精度を達成するモデル。第二に、パラメータ毎のトータルトークン(TPP): メモリ化タスクはより多くのパラメータで改善され、推論タスクは最適なTPPから恩恵を受け、推論はデータハングリーであることを示す。強化学習後学習(GRPO)もテスト時間計算の増加もこれらの傾向を変えない。そこで我々は,計算-最適スケーリングの古典的な図を改訂し,アクティブなFLOPとTPPによって最適なMoE空間を共同で決定する必要があると論じる。私たちのモデルチェックポイント、コード、ログはhttps://github.com/rioyokotalab/Optimal-sparsity.comでオープンソースです。

論文の概要: Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

関連論文リスト