Related papers: Learning Mixtures of Experts with EM

Related papers

Equivariant Evidential Deep Learning for Interatomic Potentials [55.6997213490859]
Uncertainty quantification is critical for assessing the reliability of machine learning interatomic potentials in molecular dynamics simulations.<n>Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance.<n>We propose textitEquivariant Evidential Deep Learning for Interatomic Potentials ($texte2$IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly.
arXiv Detail & Related papers (2026-02-11T02:00:25Z)
Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity [49.809923981964715]
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
arXiv Detail & Related papers (2026-01-31T23:45:50Z)
Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization [0.0]
We study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization.<n>We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity.<n>Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization.<n>We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance.
arXiv Detail & Related papers (2026-01-21T14:22:25Z)
Cluster-Based Generalized Additive Models Informed by Random Fourier Features [19.409397281817288]
This work introduces a mixture of generalized additive models (GAMs) in which random Fourier feature (RFF) representations are leveraged to uncover locally adaptive structure in the data.<n> Numerical experiments on real-world regression benchmarks, including the California Housing, NASA Air Self-Noise, and Bike Sharing datasets, demonstrate improved predictive performance.
arXiv Detail & Related papers (2025-12-22T13:15:52Z)
Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z)
Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm [5.625796693054093]
We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models.<n>We demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance.
arXiv Detail & Related papers (2025-06-13T14:57:57Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Wasserstein Convergence of Score-based Generative Models under Semiconvexity and Discontinuous Gradients [0.0]
Score-based Generative Models (SGMs) approximate a data distribution by perturbing it with Gaussian noise and subsequently denoising it via a learned diffusion process.<n>We establish the first non-asymotic Wasserstein-2 convergence guarantees for SGMs targeting semi-one order with potentially discontinuous gradients.
arXiv Detail & Related papers (2025-05-06T11:17:15Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
Network EM Algorithm for Gaussian Mixture Model in Decentralized Federated Learning [1.4549461207028445]
We study various network Expectation-Maximization (EM) algorithms for the Gaussian mixture model. We introduce a momentum network EM (MNEM) algorithm, which uses a momentum parameter to combine information from both the current and historical estimators. We also develop a semi-supervised MNEM algorithm, which leverages partially labeled data.
arXiv Detail & Related papers (2024-11-08T14:25:46Z)
Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space [72.52365911990935]
We introduce Bellman Diffusion, a novel DGM framework that maintains linearity in MDPs through gradient and scalar field modeling. Our results show that Bellman Diffusion achieves accurate field estimations and is a capable image generator, converging 1.5x faster than the traditional histogram-based baseline in distributional RL tasks.
arXiv Detail & Related papers (2024-10-02T17:53:23Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z)
Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference. Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z)
Efficient Training of Energy-Based Models Using Jarzynski Equality [13.636994997309307]
Energy-based models (EBMs) are generative models inspired by statistical physics. The computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality can be used to perform this computation efficiently.
arXiv Detail & Related papers (2023-05-30T21:07:52Z)
The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models [37.6314945221565]
Deep networks are known to generalize well to unseen data. Regularization properties ensure interpolating solutions with "good" properties are found. We present simulation results that validate the theory and introduce two data models.
arXiv Detail & Related papers (2023-02-18T22:23:42Z)
Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain. We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z)
Stochastic Mirror Descent in Average Ensemble Models [38.38572705720122]
The mirror descent (SMD) is a general class of training algorithms, which includes the celebrated gradient descent (SGD) as a special case. In this paper we explore the performance of the mirror potential algorithm on mean-field ensemble models.
arXiv Detail & Related papers (2022-10-27T11:04:00Z)
Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning [0.0]
This paper addresses the lack of rigour in the derivation of the EM training algorithm and the computational complexity of the technique. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier. To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables.
arXiv Detail & Related papers (2022-05-31T09:18:58Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)
Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation. We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z)
Training Deep Energy-Based Models with f-Divergence Minimization [113.97274898282343]
Deep energy-based models (EBMs) are very flexible in distribution parametrization but computationally challenging. We propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence. Experimental results demonstrate the superiority of f-EBM over contrastive divergence, as well as the benefits of training EBMs using f-divergences other than KL.
arXiv Detail & Related papers (2020-03-06T23:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.