Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference
- URL: http://arxiv.org/abs/2505.21721v2
- Date: Tue, 21 Oct 2025 00:38:53 GMT
- Title: Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference
- Authors: Kyurae Kim, Yi-An Ma, Trevor Campbell, Jacob R. Gardner,
- Abstract summary: We prove that black-box variational inference converges at a rate that is nearly independent of explicit dimension dependence.<n>We also prove that our bound on the variance gradient, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.
- Score: 40.00392382788829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at a rate that is nearly independent of explicit dimension dependence. Specifically, for a $d$-dimensional strongly log-concave and log-smooth target, the number of iterations for BBVI with a sub-Gaussian family to obtain a solution $\epsilon$-close to the global optimum has a dimension dependence of $\mathrm{O}(\log d)$. This is a significant improvement over the $\mathrm{O}(d)$ dependence of full-rank location-scale families. For heavy-tailed families, we prove a weaker $\mathrm{O}(d^{2/k})$ dependence, where $k$ is the number of finite moments of the family. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.
Related papers
- Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence [50.719298242863744]
Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions $propto e-V$.<n>We prove the first dimension-free KL divergence bounds for discretized ULD.
arXiv Detail & Related papers (2026-03-02T22:14:38Z) - A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data [32.98264375121064]
We consider supervised deep learning when the given explanatory variable is distributed according to an exponential family.<n>Under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $tildemathcalOleft( dfrac2lfloorbetarfloor(beta + d)2beta + dn-frac22beta + dn-frac22beta + dn-frac22beta + dn-
arXiv Detail & Related papers (2024-12-13T01:15:17Z) - Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias [13.642712817536072]
We show that as the dimension $d$ of the problem increases, the number of iterations required to ensure convergence within a desired error increases.
A key technical challenge we address is the lack of a one-step contraction property in the $W_2,ellinfty$ metric to measure convergence.
arXiv Detail & Related papers (2024-08-20T01:24:54Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Provably Scalable Black-Box Variational Inference with Structured Variational Families [8.344690980938307]
We show that certain scale matrix structures can achieve a better iteration complexity of $mathcalOleft(Nright)$, implying better scaling with respect to $N$.<n>We empirically verify our theoretical results on large-scale hierarchical models.
arXiv Detail & Related papers (2024-01-19T19:04:23Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - Convergence and concentration properties of constant step-size SGD
through Markov chains [0.0]
We consider the optimization of a smooth and strongly convex objective using constant step-size gradient descent (SGD)
We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance.
All our results are non-asymptotic and their consequences are discussed through a few applications.
arXiv Detail & Related papers (2023-06-20T12:36:28Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Quantum state tomography via non-convex Riemannian gradient descent [13.100184125419691]
In this work, we derive a quantum state tomography scheme that improves the dependence on $kappa$ to the scale.
The improvement comes from the application the non-varian gradient (RGD) algorithms.
Our theoretical results of extremely fast convergence and nearly optimal error bounds are corroborated by numerical results.
arXiv Detail & Related papers (2022-10-10T14:19:23Z) - High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad
Stepsize [55.0090961425708]
We propose a new, simplified high probability analysis of AdaGrad for smooth, non- probability problems.
We present our analysis in a modular way and obtain a complementary $mathcal O (1 / TT)$ convergence rate in the deterministic setting.
To the best of our knowledge, this is the first high probability for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness.
arXiv Detail & Related papers (2022-04-06T13:50:33Z) - Convergence of Gaussian-smoothed optimal transport distance with
sub-gamma distributions and dependent samples [12.77426855794452]
This paper provides convergence guarantees for estimating the GOT distance under more general settings.
A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy distances.
arXiv Detail & Related papers (2021-02-28T04:30:23Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.