Related papers: On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness

On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness

URL: http://arxiv.org/abs/2005.13097v1
Date: Wed, 27 May 2020 00:26:20 GMT
Title: On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness
Authors: Murat A. Erdogdu, Rasa Hosseinzadeh
Abstract summary: We show that for potentials with Lipschitz gradient, i.e. $beta=1$, our rate rate recovers the best known rate rate of dependency. Our results are applicable to $nu_* = eff$ in target distribution $nu_*$ in KL-divergence.
Score: 10.482805367361818
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study sampling from a target distribution ${\nu_* = e^{-f}}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm. For any potential function $f$ whose tails behave like ${\|x\|^\alpha}$ for ${\alpha \in [1,2]}$, and has $\beta$-H\"older continuous gradient, we prove that ${\widetilde{\mathcal{O}} \Big(d^{\frac{1}{\beta}+\frac{1+\beta}{\beta}(\frac{2}{\alpha} - \boldsymbol{1}_{\{\alpha \neq 1\}})} \epsilon^{-\frac{1}{\beta}}\Big)}$ steps are sufficient to reach the $\epsilon $-neighborhood of a $d$-dimensional target distribution $\nu_*$ in KL-divergence. This convergence rate, in terms of $\epsilon$ dependency, is not directly influenced by the tail growth rate $\alpha$ of the potential function as long as its growth is at least linear, and it only relies on the order of smoothness $\beta$. One notable consequence of this result is that for potentials with Lipschitz gradient, i.e. $\beta=1$, our rate recovers the best known rate ${\widetilde{\mathcal{O}}(d\epsilon^{-1})}$ which was established for strongly convex potentials in terms of $\epsilon$ dependency, but we show that the same rate is achievable for a wider class of potentials that are degenerately convex at infinity. The growth rate $\alpha$ starts to have an effect on the established rate in high dimensions where $d$ is large; furthermore, it recovers the best-known dimension dependency when the tail growth of the potential is quadratic, i.e. ${\alpha = 2}$, in the current setup. Our framework allows for finite perturbations, and any order of smoothness ${\beta\in(0,1]}$; consequently, our results are applicable to a wide class of non-convex potentials that are weakly smooth and exhibit at least linear tail growth.

Related papers

Almost Sure Convergence for the Last Iterate of Stochastic Gradient Descent Schemes [0.0]
We prove that Slog with constant momentum $beta in (0, 1)$(FFw_t) - F_* = O(tp-1)$ almost surely for objectives.<n>We also prove that Slog with constant momentum $beta in (0, 1)$(FFw_t) - F_* = O(tp-1)$ almost surely for objectives.
arXiv Detail & Related papers (2025-07-09T20:59:23Z)
On the $O(\rac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm [54.28350823319057]
This paper establishes the convergence rate $frac1Ksum_k=1KEleft[|nabla f(xk)|_1right]leq O(fracsqrtdCK1/4) for AdamW measured by $ell_$ norm, where $K$ represents the iteration number, $d denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD.
arXiv Detail & Related papers (2025-05-17T05:02:52Z)
On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces [2.7195102129095003]
We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In Hilbert spaces, this is optimal: $f(x_t) - inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $infty$.
arXiv Detail & Related papers (2023-10-26T17:33:52Z)
Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
Faster Rates of Convergence to Stationary Points in Differentially Private Optimization [31.46358820048179]
We study the problem of approximating stationary points of Lipschitz and smooth functions under $(varepsilon,delta)$-differential privacy (DP) A point $widehatw$ is called an $alpha$-stationary point of a function $mathbbRdrightarrowmathbbR$ if $|nabla F(widehatw)|leq alpha$. We provide a new efficient algorithm that finds an $tildeObig(big[
arXiv Detail & Related papers (2022-06-02T02:43:44Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network. Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Differentially Private $\ell_1$-norm Linear Regression with Heavy-tailed Data [22.233705161499273]
We focus on the $ell_1$-norm linear regression in the $epsilon$-DP model. We show that it is possible to achieve an upper bound of $tildeO(sqrtfracdnepsilon)$ with high probability. Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments.
arXiv Detail & Related papers (2022-01-10T08:17:05Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Convergence Rate of the (1+1)-Evolution Strategy with Success-Based Step-Size Adaptation on Convex Quadratic Functions [20.666734673282498]
The (1+1)-evolution strategy (ES) with success-based step-size adaptation is analyzed on a general convex quadratic function. The convergence rate of the (1+1)-ES is derived explicitly and rigorously on a general convex quadratic function.
arXiv Detail & Related papers (2021-03-02T09:03:44Z)
Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z)
Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence [8.873449722727026]
We show that the rate estimate $widetildemathcalO(depsilon-1)$ improves the previously known rates in both of these metrics. In particular, for convex and firstorder smooth potentials, we show that LMC algorithm achieves the rate estimate $widetildemathcalO(depsilon-1)$ which improves the previously known rates in both of these metrics.
arXiv Detail & Related papers (2020-07-22T18:18:28Z)
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging. We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.