Related papers: Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies

Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies

URL: http://arxiv.org/abs/2603.01977v1
Date: Mon, 02 Mar 2026 15:32:54 GMT
Title: Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies
Authors: Lénaïc Chizat, Maria Colombo, Roberto Colombo, Xavier Fernández-Real,
Abstract summary: We study the quantitative convergence of Wasserstein gradient flows of Kernel Mean Discrepancy functionals.<n>Our setting covers in particular the training dynamics of shallow neural networks in the infinite-width and continuous time limit.
Score: 10.511277414974613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the quantitative convergence of Wasserstein gradient flows of Kernel Mean Discrepancy (KMD) (also known as Maximum Mean Discrepancy (MMD)) functionals. Our setting covers in particular the training dynamics of shallow neural networks in the infinite-width and continuous time limit, as well as interacting particle systems with pairwise Riesz kernel interaction in the mean-field and overdamped limit. Our main analysis concerns the model case of KMD functionals given by the squared Sobolev distance $ \mathscr{E}^ν_{s}(μ)= \frac{1}{2}\lVert μ-ν\rVert_{\dot H^{-s}}^{2}$ for any $s\geq 1 $ and $ν$ a fixed probability measure on the $d$-dimensional torus. First, inspired by Yudovich theory for the $2d$-Euler equation, we establish existence and uniqueness in natural weak regularity classes. Next, we show that for $s=1$ the flow converges globally at an exponential rate under minimal assumptions, while for $s>1$ we prove local convergence at polynomial rates that depend explicitly on $s$ and on the Sobolev regularity of $μ$ and $ν$. These rates hold both at the energy level and in higher regularity classes and are tight for $ν$ uniform. We then consider the gradient flow of the population loss for shallow neural networks with ReLU activation, which can be cast as a Wasserstein--Fisher--Rao gradient flow on the space of nonnegative measures on the sphere $\mathbb{S}^d$. Exploiting a correspondence with the Sobolev energy case with $s=(d+3)/2$, we derive an explicit polynomial local convergence rate for this dynamics. Except for the special case $s=1$, even non-quantitative convergence was previously open in all these settings. We also include numerical experiments in dimension $d=1$ using both PDE and particle methods which illustrate our analysis.

Related papers

Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data [32.72306410557258]
We study the statistical convergence of score-based diffusion models for learning an unknown distribution $$ from finitely many samples.<n>Our results demonstrate that diffusion models naturally adapt to the intrinsic geometry of data.<n>Our theory conceptually bridges the analysis of diffusion models with that of GANs and the sharp minimax rates established in optimal transport.
arXiv Detail & Related papers (2026-03-04T03:59:02Z)
Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation [2.815765641180636]
We show that a constant-depth $mathrmReLUk-1$ network with bounded weights can approximate any function in the Sobolev space.<n>We also prove that our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor.
arXiv Detail & Related papers (2025-09-11T11:28:20Z)
Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights [15.424946932398713]
We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly weights that have finite-order moments.<n>We establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit.<n>In the special case where all widths are proportional to a common scale parameter $n$ and there are $L-1$ hidden layers, we obtain convergence rates of order $n-(1/6)L-1 + epsilon$, for any $epsilon > 0$.
arXiv Detail & Related papers (2025-07-16T23:41:09Z)
Neural Sampling from Boltzmann Densities: Fisher-Rao Curves in the Wasserstein Geometry [1.609940380983903]
We deal with the task of sampling from an unnormalized Boltzmann density $rho_D$ by learning a Boltzmann curve given by $f_t$. Inspired by M'at'e and Fleuret, we propose an which parametrizes only $f_t$ and fixes an appropriate $v_t$. This corresponds to the Wasserstein flow of the Kullback-Leibler divergence related to Langevin dynamics.
arXiv Detail & Related papers (2024-10-04T09:54:11Z)
Improved Finite-Particle Convergence Rates for Stein Variational Gradient Descent [14.890609936348277]
We provide finite-particle convergence rates for the Stein Variational Gradient Descent algorithm in the Kernelized Stein Discrepancy ($mathsfKSD$) and Wasserstein-2 metrics.<n>Our key insight is that the time derivative of the relative entropy between the joint density of $N$ particle locations splits into a dominant negative part' proportional to $N$ times the expected $mathsfKSD2$ and a smaller positive part'
arXiv Detail & Related papers (2024-09-13T01:49:19Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
Projected Langevin dynamics and a gradient flow for entropic optimal transport [0.8057006406834466]
We introduce analogous diffusion dynamics that sample from an entropy-regularized optimal transport. By studying the induced Wasserstein geometry of the submanifold $Pi(mu,nu)$, we argue that the SDE can be viewed as a Wasserstein gradient flow on this space of couplings.
arXiv Detail & Related papers (2023-09-15T17:55:56Z)
Theory of free fermions under random projective measurements [43.04146484262759]
We develop an analytical approach to the study of one-dimensional free fermions subject to random projective measurements of local site occupation numbers. We derive a non-linear sigma model (NLSM) as an effective field theory of the problem.
arXiv Detail & Related papers (2023-04-06T15:19:33Z)
Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron [49.45105570960104]
We prove the global convergence of randomly gradient descent with a $Oleft(T-3right)$ rate. These two bounds jointly give an exact characterization of the convergence rate. We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
arXiv Detail & Related papers (2023-02-20T15:33:26Z)
On parametric resonance in the laser action [91.3755431537592]
We consider the selfconsistent semiclassical Maxwell--Schr"odinger system for the solid state laser. We introduce the corresponding Poincar'e map $P$ and consider the differential $DP(Y0)$ at suitable stationary state $Y0$.
arXiv Detail & Related papers (2022-08-22T09:43:57Z)
Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence [8.873449722727026]
We show that the rate estimate $widetildemathcalO(depsilon-1)$ improves the previously known rates in both of these metrics. In particular, for convex and firstorder smooth potentials, we show that LMC algorithm achieves the rate estimate $widetildemathcalO(depsilon-1)$ which improves the previously known rates in both of these metrics.
arXiv Detail & Related papers (2020-07-22T18:18:28Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
Anisotropy-mediated reentrant localization [62.997667081978825]
We consider a 2d dipolar system, $d=2$, with the generalized dipole-dipole interaction $sim r-a$, and the power $a$ controlled experimentally in trapped-ion or Rydberg-atom systems. We show that the spatially homogeneous tilt $beta$ of the dipoles giving rise to the anisotropic dipole exchange leads to the non-trivial reentrant localization beyond the locator expansion.
arXiv Detail & Related papers (2020-01-31T19:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.