Rethinking Multinomial Logistic Mixture of Experts with Sigmoid Gating Function
- URL: http://arxiv.org/abs/2602.01466v1
- Date: Sun, 01 Feb 2026 22:19:16 GMT
- Title: Rethinking Multinomial Logistic Mixture of Experts with Sigmoid Gating Function
- Authors: Tuan Minh Pham, Thinh Cao, Viet Nguyen, Huy Nguyen, Nhat Ho, Alessandro Rinaldo,
- Abstract summary: We show that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation.<n>We find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters.
- Score: 84.47276999832135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sigmoid gate in mixture-of-experts (MoE) models has been empirically shown to outperform the softmax gate across several tasks, ranging from approximating feed-forward networks to language modeling. Additionally, recent efforts have demonstrated that the sigmoid gate is provably more sample-efficient than its softmax counterpart under regression settings. Nevertheless, there are three notable concerns that have not been addressed in the literature, namely (i) the benefits of the sigmoid gate have not been established under classification settings; (ii) existing sigmoid-gated MoE models may not converge to their ground-truth; and (iii) the effects of a temperature parameter in the sigmoid gate remain theoretically underexplored. To tackle these open problems, we perform a comprehensive analysis of multinomial logistic MoE equipped with a modified sigmoid gate to ensure model convergence. Our results indicate that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation. Furthermore, we find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters. To overcome this issue, we propose replacing the vanilla inner product score in the gating function with a Euclidean score that effectively removes that interaction, thereby substantially improving the sample complexity to a polynomial order.
Related papers
- Sharp Convergence Rates for Masked Diffusion Models [53.117058231393834]
We develop a total-variation based analysis for the Euler method that overcomes limitations.<n>Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees.<n>Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS.
arXiv Detail & Related papers (2026-02-26T00:47:51Z) - Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity [49.809923981964715]
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
arXiv Detail & Related papers (2026-01-31T23:45:50Z) - Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps [41.371172458797524]
Non-identifiability of gating parameters up to common translations, intrinsic gate-expert interactions, and tight numerator-denominator coupling are addressed.<n>For model selection, we adapt dendrogram-guided SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains optimal parameter rates.<n>On a dataset of drought-identifiable maize traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps.
arXiv Detail & Related papers (2025-10-14T17:23:44Z) - Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts [78.3687645289918]
We show that the sigmoid gating function enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation.
We find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating.
arXiv Detail & Related papers (2024-05-22T21:12:34Z) - Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? [27.924615931679757]
We explore the impacts of a dense-to-sparse gating mixture of experts (MoE) on the maximum likelihood estimation under the MoE.
We propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function.
arXiv Detail & Related papers (2024-01-25T01:09:09Z) - A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions.
As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z) - Towards Convergence Rates for Parameter Estimation in Gaussian-gated
Mixture of Experts [40.24720443257405]
We provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model.
Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions.
Notably, these behaviors can be characterized by the solvability of two different systems of equations.
arXiv Detail & Related papers (2023-05-12T16:02:19Z) - Accurate methods for the analysis of strong-drive effects in parametric
gates [94.70553167084388]
We show how to efficiently extract gate parameters using exact numerics and a perturbative analytical approach.
We identify optimal regimes of operation for different types of gates including $i$SWAP, controlled-Z, and CNOT.
arXiv Detail & Related papers (2021-07-06T02:02:54Z) - A Rigorous Link Between Self-Organizing Maps and Gaussian Mixture Models [78.6363825307044]
This work presents a mathematical treatment of the relation between Self-Organizing Maps (SOMs) and Gaussian Mixture Models (GMMs)
We show that energy-based SOM models can be interpreted as performing gradient descent.
This link allows to treat SOMs as generative probabilistic models, giving a formal justification for using SOMs to detect outliers, or for sampling.
arXiv Detail & Related papers (2020-09-24T14:09:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.