Convergence of Gradient Descent with Small Initialization for
Unregularized Matrix Completion
- URL: http://arxiv.org/abs/2402.06756v1
- Date: Fri, 9 Feb 2024 19:39:23 GMT
- Title: Convergence of Gradient Descent with Small Initialization for
Unregularized Matrix Completion
- Authors: Jianhao Ma, Salar Fattahi
- Abstract summary: We show that the vanilla gradient descent provably converges to the ground truth $rmXstar$ without requiring any explicit regularization.
Surprisingly, neither the convergence rate nor the final accuracy depends on the over- parameterized search rank $r'$, and they are only governed by the true rank $r$.
- Score: 21.846732043706318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of symmetric matrix completion, where the goal is to
reconstruct a positive semidefinite matrix $\rm{X}^\star \in
\mathbb{R}^{d\times d}$ of rank-$r$, parameterized by $\rm{U}\rm{U}^{\top}$,
from only a subset of its observed entries. We show that the vanilla gradient
descent (GD) with small initialization provably converges to the ground truth
$\rm{X}^\star$ without requiring any explicit regularization. This convergence
result holds true even in the over-parameterized scenario, where the true rank
$r$ is unknown and conservatively over-estimated by a search rank $r'\gg r$.
The existing results for this problem either require explicit regularization, a
sufficiently accurate initial point, or exact knowledge of the true rank $r$.
In the over-parameterized regime where $r'\geq r$, we show that, with
$\widetilde\Omega(dr^9)$ observations, GD with an initial point $\|\rm{U}_0\|
\leq \epsilon$ converges near-linearly to an $\epsilon$-neighborhood of
$\rm{X}^\star$. Consequently, smaller initial points result in increasingly
accurate solutions. Surprisingly, neither the convergence rate nor the final
accuracy depends on the over-parameterized search rank $r'$, and they are only
governed by the true rank $r$. In the exactly-parameterized regime where
$r'=r$, we further enhance this result by proving that GD converges at a faster
rate to achieve an arbitrarily small accuracy $\epsilon>0$, provided the
initial point satisfies $\|\rm{U}_0\| = O(1/d)$. At the crux of our method lies
a novel weakly-coupled leave-one-out analysis, which allows us to establish the
global convergence of GD, extending beyond what was previously possible using
the classical leave-one-out analysis.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing:
The Curses of Symmetry and Initialization [46.55524654398093]
We show how over- parameterization changes the convergence behaviors of descent.
The goal is to recover an unknown low-rank ground-rank ground-truth matrix from near-isotropic linear measurements.
We propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $alpha$.
arXiv Detail & Related papers (2023-10-03T03:34:22Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Convergence of a Normal Map-based Prox-SGD Method under the KL
Inequality [0.0]
We present a novel map-based algorithm ($mathsfnorMtext-mathsfSGD$) for $symbol$k$ convergence problems.
arXiv Detail & Related papers (2023-05-10T01:12:11Z) - Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly.
We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability.
Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z) - Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization [10.787390511207683]
We consider a twice-differentiable, $L$-smooth, $mu$-strongly convex objective $phimph over an $ntimes n$fracfrac14(L/mu-1)2rstar$.
Despite non-locality, local optimization is guaranteed to globally converge from any initial point to the global optimum.
arXiv Detail & Related papers (2022-07-05T03:18:17Z) - Preconditioned Gradient Descent for Overparameterized Nonconvex
Burer--Monteiro Factorization with Global Optimality Certification [14.674494335647841]
We consider using gradient descent to minimize the non function $f(X)=phi(XXT)$ over an $ntimes r$ factor matrix $X$.
We propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear.
arXiv Detail & Related papers (2022-06-07T14:26:49Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation [2.294014185517203]
This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-09-30T04:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.