Related papers: Learning large softmax mixtures with warm start EM

Learning large softmax mixtures with warm start EM

URL: http://arxiv.org/abs/2409.09903v2
Date: Sun, 03 Aug 2025 01:32:47 GMT
Title: Learning large softmax mixtures with warm start EM
Authors: Xin Bing, Florentina Bunea, Jonathan Niles-Weed, Marten Wegkamp,
Abstract summary: Softmax mixture models (SMMs) are discrete $K$-mixtures introduced to model the probability of choosing an $x_j in RRL$ from $p$ candidates.<n>This paper provides a comprehensive analysis of the EM algorithm for SMMs in high dimensions.
Score: 17.081578976570437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Softmax mixture models (SMMs) are discrete $K$-mixtures introduced to model the probability of choosing an attribute $x_j \in \RR^L$ from $p$ candidates, in heterogeneous populations. They have been known as mixed multinomial logits in the econometrics literature, and are gaining traction in the LLM literature, where single softmax models are routinely used in the final layer of a neural network. This paper provides a comprehensive analysis of the EM algorithm for SMMs in high dimensions. Its population-level theoretical analysis forms the basis for proving (i) local identifiability, in SSMs with generic features and, further, via a stochastic argument, (ii) full identifiability in SSMs with random features, when $p$ is large enough. These are the first results in this direction for SSMs with $L > 1$. The population-level EM analysis characterizes the initialization radius for algorithmic convergence. This also guides the construction of warm starts of the sample level EM. Under suitable initialization, the EM algorithm is shown to recover the mixture atoms of the SSM at near-parametric rate. We provide two main directions for warm start construction, both based on a new method for estimating the moments of the mixing measure underlying an SSM with random design. First, we construct a method of moments (MoM) estimator of the mixture parameters, and provide its first theoretical analysis. While MoM can enjoy parametric rates of convergence, and thus can serve as a warm-start, the estimator's quality degrades exponentially in $K$. Our recommendation, when $K$ is not small, is to run the EM algorithm several times with random initializations. We again make use of the novel latent moments estimation method to estimate the $K$-dimensional subspace of the mixture atoms. Sampling from this subspace reduces substantially the number of required draws.

Related papers

Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm [5.625796693054093]
We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models.<n>We demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance.
arXiv Detail & Related papers (2025-06-13T14:57:57Z)
Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures [53.51230405648361]
We study the dynamics of gradient EM and employ tensor decomposition to characterize the geometric landscape of the likelihood loss.<n>This is the first global convergence and recovery result for EM or gradient EM beyond the special case of $m=2$.
arXiv Detail & Related papers (2025-06-06T23:32:38Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Parallel simulation for sampling under isoperimetry and score-based diffusion models [56.39904484784127]
As data size grows, reducing the iteration cost becomes an important goal.<n>Inspired by the success of the parallel simulation of the initial value problem in scientific computation, we propose parallel Picard methods for sampling tasks.<n>Our work highlights the potential advantages of simulation methods in scientific computation for dynamics-based sampling and diffusion models.
arXiv Detail & Related papers (2024-12-10T11:50:46Z)
Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation [53.17668583030862]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. We propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP) We show that LOOP achieves a sublinear $tildemathcalO(mathrmpoly(d, mathrmsp(V*)) sqrtTbeta )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively
arXiv Detail & Related papers (2024-04-19T06:24:22Z)
Gaussian Mixture Solvers for Diffusion Models [84.83349474361204]
We introduce a novel class of SDE-based solvers called GMS for diffusion models. Our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis.
arXiv Detail & Related papers (2023-11-02T02:05:38Z)
Learning Mixtures of Gaussians Using the DDPM Objective [11.086440815804226]
We prove that gradient descent on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning.
arXiv Detail & Related papers (2023-07-03T17:44:22Z)
Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions. Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation. In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z)
Provable Multi-instance Deep AUC Maximization with Stochastic Pooling [39.46116380220933]
This paper considers a novel application of deep AUC (DAM) for multi-instance learning (MIL) A single class label is assigned to a bag of instances (e.g., multiple 2D slices of a scan for a patient)
arXiv Detail & Related papers (2023-05-14T01:29:56Z)
Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow [12.455057637445174]
We propose a new algorithm to compute the nonparametric maximum likelihood estimator (NPMLE) in a Gaussian mixture model. Our method is based on gradient descent over the space of probability measures equipped with the Wasserstein-Fisher-Rao geometry. We conduct extensive numerical experiments to confirm the effectiveness of the proposed algorithm.
arXiv Detail & Related papers (2023-01-04T18:59:35Z)
Beyond EM Algorithm on Over-specified Two-Component Location-Scale Gaussian Mixtures [29.26015093627193]
We develop the Exponential Location Update (ELU) algorithm to efficiently explore the curvature of the negative log-likelihood functions. We demonstrate that the ELU algorithm converges to the final statistical radius of the models after a logarithmic number of iterations.
arXiv Detail & Related papers (2022-05-23T06:49:55Z)
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z)
Inverting brain grey matter models with likelihood-free inference: a tool for trustable cytoarchitecture measurements [62.997667081978825]
characterisation of the brain grey matter cytoarchitecture with quantitative sensitivity to soma density and volume remains an unsolved challenge in dMRI. We propose a new forward model, specifically a new system of equations, requiring a few relatively sparse b-shells. We then apply modern tools from Bayesian analysis known as likelihood-free inference (LFI) to invert our proposed model.
arXiv Detail & Related papers (2021-11-15T09:08:27Z)
Clustering a Mixture of Gaussians with Unknown Covariance [4.821312633849745]
We derive a Max-Cut integer program based on maximum likelihood estimation. We develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. We generalize the Max-Cut program to a $k$-means program that handles multi-component mixtures with possibly unequal weights.
arXiv Detail & Related papers (2021-10-04T17:59:20Z)
Mean-Square Analysis with An Application to Optimal Dimension Dependence of Langevin Monte Carlo [60.785586069299356]
This work provides a general framework for the non-asymotic analysis of sampling error in 2-Wasserstein distance. Our theoretical analysis is further validated by numerical experiments.
arXiv Detail & Related papers (2021-09-08T18:00:05Z)
Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation. We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z)
A Rigorous Link Between Self-Organizing Maps and Gaussian Mixture Models [78.6363825307044]
This work presents a mathematical treatment of the relation between Self-Organizing Maps (SOMs) and Gaussian Mixture Models (GMMs) We show that energy-based SOM models can be interpreted as performing gradient descent. This link allows to treat SOMs as generative probabilistic models, giving a formal justification for using SOMs to detect outliers, or for sampling.
arXiv Detail & Related papers (2020-09-24T14:09:04Z)
Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models [39.27013036481509]
We introduce the nonparametric maximum likelihood (NPMLE) model for general Gaussian mixtures. We show that with high probability the NPMLE based on a sample size has $O(log n)$ atoms (mass points) Notably, any mixture is statistically in from a finite one with $Olog selection.
arXiv Detail & Related papers (2020-08-19T03:39:13Z)
Mean-Field Approximation to Gaussian-Softmax Integral with Application to Uncertainty Estimation [23.38076756988258]
We propose a new single-model based approach to quantify uncertainty in deep neural networks. We use a mean-field approximation formula to compute an analytically intractable integral. Empirically, the proposed approach performs competitively when compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-06-13T07:32:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.