Related papers: Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective

Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective

URL: http://arxiv.org/abs/2510.03151v1
Date: Fri, 03 Oct 2025 16:24:50 GMT
Title: Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective
Authors: Yehuda Dar,
Abstract summary: This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks.<n>Our MoE is defined by a segmentation of the input space to regions, each with a single- parameter expert that acts as a constant predictor.<n>We show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.
Score: 0.6345523830122167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks. Our MoE is defined by a segmentation of the input space to regions, each with a single-parameter expert that acts as a constant predictor with zero-compute at inference. Motivated by high-rate quantization theory assumptions, we assume that the number of experts is sufficiently large to make their input-space regions very small. This lets us to study the approximation error of our MoE model class: (i) for one-dimensional inputs, we formulate the test error and its minimizing segmentation and experts; (ii) for multidimensional inputs, we formulate an upper bound for the test error and study its minimization. Moreover, we consider the learning of the expert parameters from a training dataset, given an input-space segmentation, and formulate their statistical learning properties. This leads us to theoretically and empirically show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.

Related papers

Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity [49.809923981964715]
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
arXiv Detail & Related papers (2026-01-31T23:45:50Z)
On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts [66.39976432286905]
We study the convergence rates of the maximum likelihood estimator of gating and prompt parameters.<n>We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model.
arXiv Detail & Related papers (2025-05-24T01:30:46Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts [24.665178287368974]
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts.<n>This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks.
arXiv Detail & Related papers (2024-10-16T05:52:51Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.