Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures
- URL: http://arxiv.org/abs/2506.06584v1
- Date: Fri, 06 Jun 2025 23:32:38 GMT
- Title: Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures
- Authors: Mo Zhou, Weihang Xu, Maryam Fazel, Simon S. Du,
- Abstract summary: We study the dynamics of gradient EM and employ tensor decomposition to characterize the geometric landscape of the likelihood loss.<n>This is the first global convergence and recovery result for EM or gradient EM beyond the special case of $m=2$.
- Score: 53.51230405648361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning Gaussian Mixture Models (GMMs) is a fundamental problem in machine learning, with the Expectation-Maximization (EM) algorithm and its popular variant gradient EM being arguably the most widely used algorithms in practice. In the exact-parameterized setting, where both the ground truth GMM and the learning model have the same number of components $m$, a vast line of work has aimed to establish rigorous recovery guarantees for EM. However, global convergence has only been proven for the case of $m=2$, and EM is known to fail to recover the ground truth when $m\geq 3$. In this paper, we consider the $\textit{over-parameterized}$ setting, where the learning model uses $n>m$ components to fit an $m$-component ground truth GMM. In contrast to the exact-parameterized case, we provide a rigorous global convergence guarantee for gradient EM. Specifically, for any well separated GMMs in general position, we prove that with only mild over-parameterization $n = \Omega(m\log m)$, randomly initialized gradient EM converges globally to the ground truth at a polynomial rate with polynomial samples. Our analysis proceeds in two stages and introduces a suite of novel tools for Gaussian Mixture analysis. We use Hermite polynomials to study the dynamics of gradient EM and employ tensor decomposition to characterize the geometric landscape of the likelihood loss. This is the first global convergence and recovery result for EM or Gradient EM beyond the special case of $m=2$.
Related papers
- Learning large softmax mixtures with warm start EM [17.081578976570437]
Softmax mixture models (SMMs) are discrete $K$-mixtures introduced to model the probability of choosing an $x_j in RRL$ from $p$ candidates.<n>This paper provides a comprehensive analysis of the EM algorithm for SMMs in high dimensions.
arXiv Detail & Related papers (2024-09-16T00:14:48Z) - Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models [47.294535652946095]
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM)<n>This is the first global convergence result for Gaussian mixtures with more than $2$ components.
arXiv Detail & Related papers (2024-06-29T16:44:29Z) - MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.<n>Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.<n>We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Improved Convergence Guarantees for Learning Gaussian Mixture Models by
EM and Gradient EM [15.251738476719918]
We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights.
We present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works.
Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM.
arXiv Detail & Related papers (2021-01-03T08:10:01Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - The EM Algorithm gives Sample-Optimality for Learning Mixtures of
Well-Separated Gaussians [21.780375994511324]
We prove a new result for the ExpectationMaximization (EM): we show that EM converges locally, under separation $Omega(sqrtlog k)$.
Our results do not assume or use prior knowledge of the (potentially different) Gaussian components.
arXiv Detail & Related papers (2020-02-02T05:09:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.