Convergence Rates of Stochastic Zeroth-order Gradient Descent for \L
ojasiewicz Functions
- URL: http://arxiv.org/abs/2210.16997v6
- Date: Wed, 19 Apr 2023 12:20:47 GMT
- Title: Convergence Rates of Stochastic Zeroth-order Gradient Descent for \L
ojasiewicz Functions
- Authors: Tianyu Wang and Yasong Feng
- Abstract summary: We prove convergence rates of Zeroth-order Gradient Descent (SZGD) algorithms for Lojasiewicz functions.
Our results show that $ f (mathbfx_t) - f (mathbfx_infty) _t in mathbbN $ can converge faster than $ | mathbfx_infty.
- Score: 6.137707924685666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove convergence rates of Stochastic Zeroth-order Gradient Descent (SZGD)
algorithms for Lojasiewicz functions. The SZGD algorithm iterates as
\begin{align*}
\mathbf{x}_{t+1} = \mathbf{x}_t - \eta_t \widehat{\nabla} f (\mathbf{x}_t),
\qquad t = 0,1,2,3,\cdots , \end{align*} where $f$ is the objective function
that satisfies the \L ojasiewicz inequality with \L ojasiewicz exponent
$\theta$, $\eta_t$ is the step size (learning rate), and $ \widehat{\nabla} f
(\mathbf{x}_t) $ is the approximate gradient estimated using zeroth-order
information only.
Our results show that $ \{ f (\mathbf{x}_t) - f (\mathbf{x}_\infty) \}_{t \in
\mathbb{N} } $ can converge faster than $ \{ \| \mathbf{x}_t -
\mathbf{x}_\infty \| \}_{t \in \mathbb{N} }$, regardless of whether the
objective $f$ is smooth or nonsmooth.
Related papers
- Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise [38.551072383777594]
We study the problem of learning a single neuron with respect to the $L2$ loss in the presence of adversarial distribution shifts.
A new algorithm is developed to approximate the vector vector squared loss with respect to the worst distribution that is in the $chi2$divergence to the $mathcalp_0$.
arXiv Detail & Related papers (2024-11-11T03:43:52Z) - In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients.
We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z) - Complexity of Minimizing Projected-Gradient-Dominated Functions with Stochastic First-order Oracles [38.45952947660789]
This work investigates the performance limits of projected first-order methods for minimizing functions under the $(alpha,tau,mathcal)$-projected-dominance property.
arXiv Detail & Related papers (2024-08-03T18:34:23Z) - Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization.
Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires.
We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix
Factorization [54.29685789885059]
We introduce efficient $(1+varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem.
The goal is to approximate $mathbfA$ as a product of low-rank factors.
Our techniques generalize to other common variants of the BMF problem.
arXiv Detail & Related papers (2023-06-02T18:55:27Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - On Outer Bi-Lipschitz Extensions of Linear Johnson-Lindenstrauss
Embeddings of Low-Dimensional Submanifolds of $\mathbb{R}^N$ [0.24366811507669117]
Let $mathcalM$ be a compact $d$-dimensional submanifold of $mathbbRN$ with reach $tau$ and volume $V_mathcal M$.
We prove that a nonlinear function $f: mathbbRN rightarrow mathbbRmm exists with $m leq C left(d / epsilon2right) log left(fracsqrt[d]V_math
arXiv Detail & Related papers (2022-06-07T15:10:46Z) - On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
We describe an implicit sparsity-inducing mechanism based on over a family of kernels.
As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
arXiv Detail & Related papers (2021-10-12T09:36:41Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.