Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity
- URL: http://arxiv.org/abs/2602.00939v1
- Date: Sat, 31 Jan 2026 23:45:50 GMT
- Title: Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity
- Authors: Fanqi Yan, Dung Le, Trang Pham, Huy Nguyen, Nhat Ho,
- Abstract summary: Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
- Score: 49.809923981964715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task. Despite recent efforts to analyze the convergence behavior of parameter estimation in this model, there are still two unresolved problems in the literature. First, the contaminated MoE model has been studied solely in regression settings, while its theoretical foundation in classification settings remains absent. Second, previous works on MoE models for classification capture pointwise convergence rates for parameter estimation without any guaranty of minimax optimality. In this work, we close these gaps by performing, for the first time, the convergence analysis of a contaminated mixture of multinomial logistic experts with homogeneous and heterogeneous structures, respectively. In each regime, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size. Furthermore, we also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal. Notably, our theories offer an important insight into the design of contaminated MoE, that is, expert heterogeneity yields faster parameter estimation rates and, therefore, is more sample-efficient than expert homogeneity.
Related papers
- Sharp Convergence Rates for Masked Diffusion Models [53.117058231393834]
We develop a total-variation based analysis for the Euler method that overcomes limitations.<n>Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees.<n>Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS.
arXiv Detail & Related papers (2026-02-26T00:47:51Z) - Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models [40.216463162163976]
We develop a batch minorization-maximization algorithm for softmax-gated multinomial-logistic MoE.<n>We also prove finite-sample rates for conditional density estimation and parameter recovery.<n>Experiments on biological protein--protein interaction prediction validate the full pipeline.
arXiv Detail & Related papers (2026-02-08T14:45:41Z) - Rethinking Multinomial Logistic Mixture of Experts with Sigmoid Gating Function [84.47276999832135]
We show that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation.<n>We find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters.
arXiv Detail & Related papers (2026-02-01T22:19:16Z) - Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning [56.87989363424]
We show that a low-rank structure naturally emerges in the shifted successor measure.<n>We quantify the amount of shift needed for effective low-rank approximation and estimation.
arXiv Detail & Related papers (2025-09-05T15:48:20Z) - On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts [66.39976432286905]
We study the convergence rates of the maximum likelihood estimator of gating and prompt parameters.<n>We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model.
arXiv Detail & Related papers (2025-05-24T01:30:46Z) - Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts [24.665178287368974]
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts.<n>This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks.
arXiv Detail & Related papers (2024-10-16T05:52:51Z) - Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts [78.3687645289918]
We show that the sigmoid gating function enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation.
We find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating.
arXiv Detail & Related papers (2024-05-22T21:12:34Z) - A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions.
As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z) - Towards Convergence Rates for Parameter Estimation in Gaussian-gated
Mixture of Experts [40.24720443257405]
We provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model.
Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions.
Notably, these behaviors can be characterized by the solvability of two different systems of equations.
arXiv Detail & Related papers (2023-05-12T16:02:19Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.