Related papers: Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

URL: http://arxiv.org/abs/2511.22069v1
Date: Thu, 27 Nov 2025 03:41:48 GMT
Title: Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian
Authors: Yiran Zhang, Weihang Xu, Mo Zhou, Maryam Fazel, Simon Shaolei Du,
Abstract summary: We study gradient descent for training models to learn a single Gaussian distribution.<n>We prove a global convergence result for gradient descent under multiple regimes.<n>This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.
Score: 48.340460104014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/τ$ rate, where $τ$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.

Related papers

RANSAC Scoring Functions: Analysis and Reality Check [0.0]
We revisit the problem of assigning a score (a quality of fit) to candidate geometric models.<n>We show that a threshold-based parameterization leads to a unified view of likelihood-based and robust M-estimators.
arXiv Detail & Related papers (2025-12-22T20:08:46Z)
Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent [1.7188280334580195]
We revisit Toeplitz covariance estimation through the lens of simple descent (GD)<n>We show that when $K = P$, GD may converge to suboptimal solutions.<n>We propose an accelerated GD variant with separate learning rates for amplitudes and frequencies.
arXiv Detail & Related papers (2025-11-03T14:07:53Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
On the Wasserstein Convergence and Straightness of Rectified Flow [54.580605276017096]
Rectified Flow (RF) is a generative model that aims to learn straight flow trajectories from noise to data.<n>We provide a theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution.<n>We present general conditions guaranteeing uniqueness and straightness of 1-RF, which is in line with previous empirical findings.
arXiv Detail & Related papers (2024-10-19T02:36:11Z)
Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling [15.55478112106342]
We consider the problem of approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand.<n>We propose an efficient procedure which exploits a small i.i.d. random subset of $mn$ samples drawn either uniformly or using approximate leverage scores from the initial observations.
arXiv Detail & Related papers (2023-11-22T17:44:18Z)
Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
Gradient flows and randomised thresholding: sparse inversion and classification [0.0]
Sparse inversion and classification problems are ubiquitous in modern data science and imaging. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems.
arXiv Detail & Related papers (2022-03-22T09:21:14Z)
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement. For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts. We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z)
Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting and Regularization [39.35822033674126]
We study binary linear classification under a generative Gaussian mixture model. We derive novel non-asymptotic bounds on the classification error of the latter. Our results extend to a noisy model with constant probability noise flips.
arXiv Detail & Related papers (2020-11-18T07:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.