Related papers: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

URL: http://arxiv.org/abs/2509.25913v2
Date: Tue, 14 Oct 2025 07:04:14 GMT
Title: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka,
Abstract summary: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs)<n>Traditionally, MoE relies on $mathrmSoftmax$ as the router score function to aggregate expert output.<n>We propose the textbfzero-additional-cost Kernel Router with Normalization (KERN) as an alternative to $mathrmSoftmax$.
Score: 87.60286115014833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.

Related papers

Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets [9.94514344279733]
Arithmetic-Mean $mu$P constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale.<n>We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $etastar(L)propto L-3/2$; with zero padding, boundary effects are constant-level as $Ngg k$.
arXiv Detail & Related papers (2025-10-05T19:22:50Z)
VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations [49.1574468325115]
We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations.<n>The proposed approach employs an encoder to reduce the high-dimensional input $y(bmx)$ to a lower-dimensional latent space, $bmmu_bmphi_y$.<n>A fully connected neural network is used to map $bmmu_bmphi_y$ to the latent space, $bmmu_bmphi_h$, of the P
arXiv Detail & Related papers (2025-08-05T18:37:32Z)
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.73697327316267]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.<n>We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.<n>ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z)
Enhanced Feature Learning via Regularisation: Integrating Neural Networks and Kernel Methods [0.0]
We consider functions as expectations of Sobolev functions over all possible one-dimensional projections of the data.<n>This framework is similar to kernel ridge regression, where the kernel is $mathbbE_w ( k(B)(wtop x,wtop xprime))$, with $k(B)(a,b) := min(|a|, |b|)mathds1_ab>0$ the Brownian kernel, and the distribution of the projections $w$ is learnt
arXiv Detail & Related papers (2024-07-24T13:46:50Z)
A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax ) This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z)
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis. We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods. In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.