Preconditioned Gradient Descent for Overparameterized Nonconvex
Burer--Monteiro Factorization with Global Optimality Certification
- URL: http://arxiv.org/abs/2206.03345v2
- Date: Thu, 20 Apr 2023 22:38:06 GMT
- Title: Preconditioned Gradient Descent for Overparameterized Nonconvex
Burer--Monteiro Factorization with Global Optimality Certification
- Authors: Gavin Zhang, Salar Fattahi, Richard Y. Zhang
- Abstract summary: We consider using gradient descent to minimize the non function $f(X)=phi(XXT)$ over an $ntimes r$ factor matrix $X$.
We propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear.
- Score: 14.674494335647841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider using gradient descent to minimize the nonconvex function
$f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is
an underlying smooth convex cost function defined over $n\times n$ matrices.
While only a second-order stationary point $X$ can be provably found in
reasonable time, if $X$ is additionally rank deficient, then its rank
deficiency certifies it as being globally optimal. This way of certifying
global optimality necessarily requires the search rank $r$ of the current
iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the
global minimizer $X^{\star}$. Unfortunately, overparameterization significantly
slows down the convergence of gradient descent, from a linear rate with
$r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is
strongly convex. In this paper, we propose an inexpensive preconditioner that
restores the convergence rate of gradient descent back to linear in the
overparameterized case, while also making it agnostic to possible
ill-conditioning in the global minimizer $X^{\star}$.
Related papers
- A Proximal Modified Quasi-Newton Method for Nonsmooth Regularized Optimization [0.7373617024876725]
Lipschitz-of-$nabla f$.
$mathcalS_k|p$.
$mathcalS_k|p$.
$nabla f$.
$mathcalS_k|p$.
arXiv Detail & Related papers (2024-09-28T18:16:32Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Convergence of Gradient Descent with Small Initialization for
Unregularized Matrix Completion [21.846732043706318]
We show that the vanilla gradient descent provably converges to the ground truth $rmXstar$ without requiring any explicit regularization.
Surprisingly, neither the convergence rate nor the final accuracy depends on the over- parameterized search rank $r'$, and they are only governed by the true rank $r$.
arXiv Detail & Related papers (2024-02-09T19:39:23Z) - Estimating the minimizer and the minimum value of a regression function
under passive design [72.85024381807466]
We propose a new method for estimating the minimizer $boldsymbolx*$ and the minimum value $f*$ of a smooth and strongly convex regression function $f$.
We derive non-asymptotic upper bounds for the quadratic risk and optimization error of $boldsymbolz_n$, and for the risk of estimating $f*$.
arXiv Detail & Related papers (2022-11-29T18:38:40Z) - Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient
Oracle Complexity [15.18055488087588]
We give an algorithm that minimizes the above convex formulation to $epsilon$-accuracy in $widetildeO(sum_i=1n d_i log (1 /epsilon))$ gradient computations.
Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods.
arXiv Detail & Related papers (2022-08-07T20:53:42Z) - Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization [10.787390511207683]
We consider a twice-differentiable, $L$-smooth, $mu$-strongly convex objective $phimph over an $ntimes n$fracfrac14(L/mu-1)2rstar$.
Despite non-locality, local optimization is guaranteed to globally converge from any initial point to the global optimum.
arXiv Detail & Related papers (2022-07-05T03:18:17Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - Minimax Optimal Regression over Sobolev Spaces via Laplacian
Regularization on Neighborhood Graphs [25.597646488273558]
We study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression.
We prove that Laplacian smoothing is manifold-adaptive.
arXiv Detail & Related papers (2021-06-03T01:20:41Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.