Fugu-MT 論文翻訳(概要): Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

論文の概要: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

arxiv url: http://arxiv.org/abs/2509.25913v1
Date: Tue, 30 Sep 2025 08:04:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.061349
Title: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Title（参考訳）: Nadaraya-Watsonカーネルによる実験の混合を理解する
Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka,
Abstract要約: Mixture-of-Experts (MoE)は最近の最先端の大規模言語モデル(LLM)の基盤となっている。伝統的に、MoEはエキスパート出力を集約するためにルータスコア関数として$mathrmSoftmax$に依存している。 mathrmSoftmax$の代替として,textbfzero-additional-cost Kernel Router with Normalization (KERN)を提案する。
参考スコア（独自算出の注目度）: 87.60286115014833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
Abstract（参考訳）: Mixture-of-Experts (MoE)は最近の最先端の大規模言語モデル(LLM)の基盤となっている。伝統的に、MoEはルータスコア関数として$\mathrm{Softmax}$に依存しており、これは初期のMoEモデルから現代のLCMまで継続する設計上の選択であり、現在では標準的慣行とみなされている。しかし、ルータ重みを確率的単純度に射影するために$\mathrm{Softmax}$を使用する必要性は、原則的な設計選択というよりは無意味な仮定のままである。本研究では、まず古典的なナダラヤ・ワトソン回帰を再検討し、MoEがナダラヤ・ワトソン回帰と同じ数学的定式化を共有することを観察する。さらに, フィードフォワードニューラルネットワーク(FFN)とMoEは, カーネル関数が出力層の入力ニューロンに対応するナダラヤ・ワトソン回帰(Nadaraya-Watson regression)の特別な場合と解釈できることを示した。これらの知見により,FFN型ルータ関数であるKERN(textbf{zero-additional-cost} Kernel Inspired Router)を$\mathrm{Softmax}$の代替として提案する。このルータは、$\mathrm{Sigmoid}$-と$\mathrm{Softmax}$-ベースルータの両方を一般化する。経験的観測とFFN実装の確立した実践に基づいて、$\mathrm{ReLU}$Activation と $\ell_2$-normalization in $\mathrm{KERN}$ router function を推奨する。 } MoE と LLM の総合実験により提案した FFN スタイルのルータ関数 \methodNorm の有効性が検証された。

論文の概要: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

関連論文リスト