Related papers: Learning Mixtures of Experts with EM: A Mirror Descent Perspective

Learning Mixtures of Experts with EM: A Mirror Descent Perspective

URL: http://arxiv.org/abs/2411.06056v2
Date: Fri, 23 May 2025 20:53:09 GMT
Title: Learning Mixtures of Experts with EM: A Mirror Descent Perspective
Authors: Quentin Fruytier, Aryan Mokhtari, Sujay Sanghavi,
Abstract summary: Classical Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition.<n>We study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models.
Score: 28.48469221248906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of $2$ linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

Related papers

Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm [5.625796693054093]
We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models.<n>We demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance.
arXiv Detail & Related papers (2025-06-13T14:57:57Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Wasserstein Convergence of Score-based Generative Models under Semiconvexity and Discontinuous Gradients [0.0]
Score-based Generative Models (SGMs) approximate a data distribution by perturbing it with Gaussian noise and subsequently denoising it via a learned diffusion process.<n>We establish the first non-asymotic Wasserstein-2 convergence guarantees for SGMs targeting semi-one order with potentially discontinuous gradients.
arXiv Detail & Related papers (2025-05-06T11:17:15Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
Network EM Algorithm for Gaussian Mixture Model in Decentralized Federated Learning [1.4549461207028445]
We study various network Expectation-Maximization (EM) algorithms for the Gaussian mixture model. We introduce a momentum network EM (MNEM) algorithm, which uses a momentum parameter to combine information from both the current and historical estimators. We also develop a semi-supervised MNEM algorithm, which leverages partially labeled data.
arXiv Detail & Related papers (2024-11-08T14:25:46Z)
Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space [72.52365911990935]
We introduce Bellman Diffusion, a novel DGM framework that maintains linearity in MDPs through gradient and scalar field modeling. Our results show that Bellman Diffusion achieves accurate field estimations and is a capable image generator, converging 1.5x faster than the traditional histogram-based baseline in distributional RL tasks.
arXiv Detail & Related papers (2024-10-02T17:53:23Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z)
Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference. Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z)
Efficient Training of Energy-Based Models Using Jarzynski Equality [13.636994997309307]
Energy-based models (EBMs) are generative models inspired by statistical physics. The computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality can be used to perform this computation efficiently.
arXiv Detail & Related papers (2023-05-30T21:07:52Z)
The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models [37.6314945221565]
Deep networks are known to generalize well to unseen data. Regularization properties ensure interpolating solutions with "good" properties are found. We present simulation results that validate the theory and introduce two data models.
arXiv Detail & Related papers (2023-02-18T22:23:42Z)
Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain. We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z)
Stochastic Mirror Descent in Average Ensemble Models [38.38572705720122]
The mirror descent (SMD) is a general class of training algorithms, which includes the celebrated gradient descent (SGD) as a special case. In this paper we explore the performance of the mirror potential algorithm on mean-field ensemble models.
arXiv Detail & Related papers (2022-10-27T11:04:00Z)
Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning [0.0]
This paper addresses the lack of rigour in the derivation of the EM training algorithm and the computational complexity of the technique. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier. To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables.
arXiv Detail & Related papers (2022-05-31T09:18:58Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)
Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation. We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z)
Training Deep Energy-Based Models with f-Divergence Minimization [113.97274898282343]
Deep energy-based models (EBMs) are very flexible in distribution parametrization but computationally challenging. We propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence. Experimental results demonstrate the superiority of f-EBM over contrastive divergence, as well as the benefits of training EBMs using f-divergences other than KL.
arXiv Detail & Related papers (2020-03-06T23:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.