The Power of Preconditioning in Overparameterized Low-Rank Matrix
Sensing
- URL: http://arxiv.org/abs/2302.01186v3
- Date: Mon, 6 Nov 2023 13:47:12 GMT
- Title: The Power of Preconditioning in Overparameterized Low-Rank Matrix
Sensing
- Authors: Xingyu Xu, Yandi Shen, Yuejie Chi, Cong Ma
- Abstract summary: $textsfScaledGD($lambda$)$ is a preconditioned gradient descent method to tackle the low-rank matrix sensing problem.
We show that $textsfScaledGD($lambda$)$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations.
- Score: 42.905196856926615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent
method to tackle the low-rank matrix sensing problem when the true rank is
unknown, and when the matrix is possibly ill-conditioned. Using
overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts
from a small random initialization, and proceeds by gradient descent with a
specific form of damped preconditioning to combat bad curvatures induced by
overparameterization and ill-conditioning. At the expense of light
computational overhead incurred by preconditioners,
$\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning
compared to vanilla gradient descent ($\textsf{GD}$) even with
overprameterization. Specifically, we show that, under the Gaussian design,
$\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a
constant linear rate after a small number of iterations that scales only
logarithmically with respect to the condition number and the problem dimension.
This significantly improves over the convergence rate of vanilla $\textsf{GD}$
which suffers from a polynomial dependency on the condition number. Our work
provides evidence on the power of preconditioning in accelerating the
convergence without hurting generalization in overparameterized learning.
Related papers
- Efficient Over-parameterized Matrix Sensing from Noisy Measurements via Alternating Preconditioned Gradient Descent [17.73720530889677]
Preconditioning methods have been proposed to accelerate the convergence of matrix sensing problem.
We propose an alternating preconditioned descent (APGD) algorithm, which alternately updates the two factor parameter.
We theoretically prove that APGD achieves near-optimal convergence at a linear rate, starting from arbitrary randoms.
arXiv Detail & Related papers (2025-02-01T15:44:39Z) - SAPPHIRE: Preconditioned Stochastic Variance Reduction for Faster Large-Scale Statistical Learning [18.055120576191204]
Ill-conditioned objectives and nonsmooth regularizers undermine the performance of traditional convex methods.
We propose a variance-free solution for ill-conditioned composite large-scale machine learning problems.
arXiv Detail & Related papers (2025-01-27T10:36:45Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Preconditioned Gradient Descent for Overparameterized Nonconvex
Burer--Monteiro Factorization with Global Optimality Certification [14.674494335647841]
We consider using gradient descent to minimize the non function $f(X)=phi(XXT)$ over an $ntimes r$ factor matrix $X$.
We propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear.
arXiv Detail & Related papers (2022-06-07T14:26:49Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - WARPd: A linearly convergent first-order method for inverse problems
with approximate sharpness conditions [0.0]
Sharpness conditions directly control the recovery performance of restart schemes for first-order methods.
We provide a first-order method: Weighted, Accelerated and Restarted Primal-dual (WARPd)
Under a generic approximate sharpness condition, WARPd achieves stable linear convergence to the desired vector.
We show how WARPd compares favorably with specialized state-of-the-art methods and is ideally suited for solving large-scale problems.
arXiv Detail & Related papers (2021-10-24T13:19:41Z) - Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled
Gradient Descent [34.0533596121548]
Low-rank matrix estimation converges convex problem that finds numerous applications in signal processing, machine learning and imaging science.
We show that ScaledGD achieves the best of the best in terms of the number of the low-rank matrix.
Our analysis is also applicable to general loss that are similar to low-rank gradient descent.
arXiv Detail & Related papers (2020-05-18T17:17:16Z) - Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$.
No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.