Related papers: Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

URL: http://arxiv.org/abs/2405.13997v3
Date: Sat, 02 Nov 2024 17:49:00 GMT
Title: Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo,
Abstract summary: We show that the sigmoid gating function enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. We find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating.
Score: 78.3687645289918
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.

Related papers

On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts [66.39976432286905]
We study the convergence rates of the maximum likelihood estimator of gating and prompt parameters.<n>We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model.
arXiv Detail & Related papers (2025-05-24T01:30:46Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective [69.72942835553228]
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart. We show that ''experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
arXiv Detail & Related papers (2025-02-01T02:36:14Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z)
Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? [27.924615931679757]
We explore the impacts of a dense-to-sparse gating mixture of experts (MoE) on the maximum likelihood estimation under the MoE. We propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function.
arXiv Detail & Related papers (2024-01-25T01:09:09Z)
Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts. Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards. We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z)
In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer [27.025808709031864]
It has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring. We show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness.
arXiv Detail & Related papers (2023-11-02T09:15:52Z)
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)
Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts [28.907764868329988]
We study the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. Our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells.
arXiv Detail & Related papers (2023-09-25T03:28:01Z)
Deep Hierarchy in Bandits [51.22833900944146]
Mean rewards of actions are often correlated. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model.
arXiv Detail & Related papers (2022-02-03T08:15:53Z)
Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.