Fugu-MT 論文翻訳(概要): Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

論文の概要: Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

arxiv url: http://arxiv.org/abs/2603.10379v1
Date: Wed, 11 Mar 2026 03:49:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.770139
Title: Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
Title（参考訳）: Mixture-of-Expertsにおけるエキスパートアテンションの最適配置:動的モデル設計のためのスケーラブルな法則
Authors: Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu,
Abstract要約: Mixture-of-Experts (MoE) モデルは、比例的に計算量を増やすことなく、モデルのキャパシティを効率的にスケーリングする方法として登場した。専門家層と注目層を対象とするトークン当たりのFLOPの割合として、r$という比率を定義します。我々の分析では、r*$の明示的な公式が導かれ、エキスパート・アテンションの計算割り当てを正確に制御することができる。
参考スコア（独自算出の注目度）: 37.14769075463234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^*$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^*$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.
Abstract（参考訳）: 本稿では,Mixture-of-Experts(MoE)モデルに対するニューラルスケーリング法則の新たな拡張について述べる。 MoEアーキテクチャは、比例的に計算量を増加させることなく、モデルキャパシティを効率的にスケーリングする方法として登場したため、最適なエキスパート・アテンションの計算比率を決定することが重要である。我々は、専門家層と注目層を対象とするトークン当たりのFLOPの総比率を$r$と定義し、この比率が全体計算予算とモデル空間とどのように相互作用するかを考察する。 GPT方式のMoE変換器による広範な実験により、最適比$r^*$は、全計算との電力-法則関係に従い、間隔によって変化することを実証的に見出した。我々の分析では、r^*$ の明示的な公式が導かれ、エキスパート・アテンションの計算割り当てを正確に制御できる。我々は、このアーキテクチャパラメータを組み込むことで、Chinchillaスケーリング法を一般化し、サイズやデータを超えたMoEモデルをチューニングするための新しいフレームワークを提供する。本研究は, 効率的なMoEモデルを設計し, 一定の計算予算を尊重しながら, 性能を最適化するための実践的ガイドラインを提供する。

論文の概要: Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

関連論文リスト