Mean-field limit from general mixtures of experts to quantum neural networks
- URL: http://arxiv.org/abs/2501.14660v1
- Date: Fri, 24 Jan 2025 17:29:41 GMT
- Title: Mean-field limit from general mixtures of experts to quantum neural networks
- Authors: Anderson Melchor Hernandez, Davide Pastorello, Giacomo De Palma,
- Abstract summary: We study the behavior of Mixture of Experts (MoE) trained via gradient flow on supervised learning problems.<n>Our main result establishes the propagation of chaos for a MoE as the number of experts diverges.
- Score: 3.7498611358320733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we study the asymptotic behavior of Mixture of Experts (MoE) trained via gradient flow on supervised learning problems. Our main result establishes the propagation of chaos for a MoE as the number of experts diverges. We demonstrate that the corresponding empirical measure of their parameters is close to a probability measure that solves a nonlinear continuity equation, and we provide an explicit convergence rate that depends solely on the number of experts. We apply our results to a MoE generated by a quantum neural network.
Related papers
- On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts [66.39976432286905]
We study the convergence rates of the maximum likelihood estimator of gating and prompt parameters.<n>We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model.
arXiv Detail & Related papers (2025-05-24T01:30:46Z) - Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures [24.865197779389323]
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning.<n>We introduce a novel extension to Gaussian-gated MoE models that enables consistent estimation of the true number of mixture components.<n> Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts.
arXiv Detail & Related papers (2025-05-19T12:41:19Z) - Avoided-crossings, degeneracies and Berry phases in the spectrum of quantum noise through analytic Bloch-Messiah decomposition [49.1574468325115]
"analytic Bloch-Messiah decomposition" provides approach for characterizing dynamics of quantum optical systems.
We show that avoided crossings arise naturally when a single parameter is varied, leading to hypersensitivity of the singular vectors.
We highlight the possibility of programming the spectral response of photonic systems through the deliberate design of avoided crossings.
arXiv Detail & Related papers (2025-04-29T13:14:15Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.
Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.
We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts [24.665178287368974]
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts.
This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks.
arXiv Detail & Related papers (2024-10-16T05:52:51Z) - Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts)
Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z) - On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model.
We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions.
Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z) - Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? [27.924615931679757]
We explore the impacts of a dense-to-sparse gating mixture of experts (MoE) on the maximum likelihood estimation under the MoE.
We propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function.
arXiv Detail & Related papers (2024-01-25T01:09:09Z) - Towards Convergence Rates for Parameter Estimation in Gaussian-gated
Mixture of Experts [40.24720443257405]
We provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model.
Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions.
Notably, these behaviors can be characterized by the solvability of two different systems of equations.
arXiv Detail & Related papers (2023-05-12T16:02:19Z) - Demystifying Softmax Gating Function in Gaussian Mixture of Experts [34.53974702114644]
We propose novel Voronoi loss functions among parameters and establish the convergence rates of maximum likelihood estimator (MLE) for solving parameter estimation.
Our findings show a connection between the convergence rate of the MLE and a solvability problem of a system of equations.
arXiv Detail & Related papers (2023-05-05T05:37:55Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - Momentum Diminishes the Effect of Spectral Bias in Physics-Informed
Neural Networks [72.09574528342732]
Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs)
They often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias.
In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under gradient descent with momentum (SGDM)
arXiv Detail & Related papers (2022-06-29T19:03:10Z) - Bayesian Uncertainty Estimation of Learned Variational MRI
Reconstruction [63.202627467245584]
We introduce a Bayesian variational framework to quantify the model-immanent (epistemic) uncertainty.
We demonstrate that our approach yields competitive results for undersampled MRI reconstruction.
arXiv Detail & Related papers (2021-02-12T18:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.