Related papers: Sharp Global Guarantees for Nonconvex Low-rank Recovery in the Noisy Overparameterized Regime

Sharp Global Guarantees for Nonconvex Low-rank Recovery in the Noisy Overparameterized Regime

URL: http://arxiv.org/abs/2104.10790v3
Date: Tue, 06 May 2025 17:29:39 GMT
Title: Sharp Global Guarantees for Nonconvex Low-rank Recovery in the Noisy Overparameterized Regime
Authors: Richard Y. Zhang,
Abstract summary: We introduce a novel proof technique that unifies, simplifies, and strengthens two previously competing approaches.<n>We show that near-second-order points achieve the same minimax recovery bounds as significantly more expensive convex approaches.
Score: 10.787390511207683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work established that rank overparameterization eliminates spurious local minima in nonconvex low-rank matrix recovery under the restricted isometry property (RIP). But this does not fully explain the practical success of overparameterization, because real algorithms can still become trapped at nonstrict saddle points (approximate second-order points with arbitrarily small negative curvature) even when all local minima are global. Moreover, the result does not accommodate for noisy measurements, but it is unclear whether such an extension is even possible, in view of the many discontinuous and unintuitive behaviors already known for the overparameterized regime. In this paper, we introduce a novel proof technique that unifies, simplifies, and strengthens two previously competing approaches -- one based on escape directions and the other based on the inexistence of counterexample -- to provide sharp global guarantees in the noisy overparameterized regime. We show, once local minima have been converted into global minima through slight overparameterization, that near-second-order points achieve the same minimax-optimal recovery bounds (up to small constant factors) as significantly more expensive convex approaches. Our results are sharp with respect to the noise level and the solution accuracy, and hold for both the symmetric parameterization $XX^{T}$, as well as the asymmetric parameterization $UV^{T}$ under a balancing regularizer; we demonstrate that the balancing regularizer is indeed necessary.

Related papers

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization [50.49466204159458]
We propose two novel estimators based on the idea of noise symmetrization.<n>We provide a sharper analysis and improved rates.<n>Compared to works assuming symmetric noise with moments, we provide a sharper analysis and improved rates.
arXiv Detail & Related papers (2025-07-12T00:31:13Z)
Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model [13.582475656749775]
We study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model. We establish nearly matching upper and lower bounds on the sample complexity of this problem.
arXiv Detail & Related papers (2025-03-11T22:31:03Z)
Sample Complexity of Linear Quadratic Regulator Without Initial Stability [11.98212766542468]
Inspired by REINFORCE, we introduce a novel receding-horizon algorithm for the Linear Quadratic Regulator (LQR) problem with unknown parameters.<n>Unlike prior methods, our algorithm avoids reliance on two-point gradient estimates while maintaining the same order of sample complexity.
arXiv Detail & Related papers (2025-02-20T02:44:25Z)
From Gradient Clipping to Normalization for Heavy Tailed SGD [19.369399536643773]
Recent empirical evidence indicates that machine learning applications involve heavy-tailed noise, which challenges the standard assumptions of bounded variance in practice.<n>In this paper, we show that it is possible to achieve tightness of the gradient-dependent noise convergence problem under tailed noise.
arXiv Detail & Related papers (2024-10-17T17:59:01Z)
Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization [8.879403568685499]
We introduce an adaptive updating strategy for smoothing parameters. This behavior transforms the algorithm into one that effectively solves problems after a few iterations. We prove the global proposed experiment, guaranteeing that every iteration is a critical one.
arXiv Detail & Related papers (2024-06-22T02:37:13Z)
A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions. We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z)
Low-rank Matrix Bandits with Heavy-tailed Rewards [55.03293214439741]
We study the problem of underlinelow-rank matrix bandit with underlineheavy-underlinetailed underlinerewards (LowHTR) By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS.
arXiv Detail & Related papers (2024-04-26T21:54:31Z)
Convergence of Gradient Descent with Small Initialization for Unregularized Matrix Completion [21.846732043706318]
We show that the vanilla gradient descent provably converges to the ground truth $rmXstar$ without requiring any explicit regularization. Surprisingly, neither the convergence rate nor the final accuracy depends on the over- parameterized search rank $r'$, and they are only governed by the true rank $r$.
arXiv Detail & Related papers (2024-02-09T19:39:23Z)
Riemannian stochastic optimization methods avoid strict saddle points [68.80251170757647]
We show that policies under study avoid strict saddle points / submanifolds with probability 1. This result provides an important sanity check as it shows that, almost always, the limit state of an algorithm can only be a local minimizer.
arXiv Detail & Related papers (2023-11-04T11:12:24Z)
Distributionally Robust Optimization with Bias and Variance Reduction [9.341215359733601]
We show that Prospect, a gradient-based algorithm, enjoys linear convergence for smooth regularized losses. We also show that Prospect can converge 2-3$times$ faster than baselines such as gradient-based methods.
arXiv Detail & Related papers (2023-10-21T00:03:54Z)
Implicit bias of SGD in $L_{2}$-regularized linear DNNs: One-way jumps from high to low rank [22.850359181621023]
In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero.
arXiv Detail & Related papers (2023-05-25T13:17:32Z)
On the Optimality of Misspecified Kernel Ridge Regression [13.995944403996566]
We show that KRR is minimax optimal for any $sin (0,1)$ when the $mathcalH$ is a Sobolev RKHS.
arXiv Detail & Related papers (2023-05-12T04:12:12Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization [10.787390511207683]
We consider a twice-differentiable, $L$-smooth, $mu$-strongly convex objective $phimph over an $ntimes n$fracfrac14(L/mu-1)2rstar$. Despite non-locality, local optimization is guaranteed to globally converge from any initial point to the global optimum.
arXiv Detail & Related papers (2022-07-05T03:18:17Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization [4.7464518249313805]
Sub-gradient method (SubGM) is used to recover a low-rank matrix from a limited number of measurements. We show that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values.
arXiv Detail & Related papers (2022-02-17T17:50:04Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector. We show that this dependence on $d$ is optimal, up to logarithmic factors. We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions. Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z)
Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
We study efficient estimation and detection of a planted vector $v$ whose $ell_4$ norm differs from that of a Gaussian vector with the same $ell$ norm. We show that in the regime $n rho gg sqrtN$, any spectral method from a large class (and more generally, any low-degree of the input) fails to detect the planted vector.
arXiv Detail & Related papers (2021-05-31T16:10:49Z)
The Average-Case Time Complexity of Certifying the Restricted Isometry Property [66.65353643599899]
In compressed sensing, the restricted isometry property (RIP) on $M times N$ sensing matrices guarantees efficient reconstruction of sparse vectors. We investigate the exact average-case time complexity of certifying the RIP property for $Mtimes N$ matrices with i.i.d. $mathcalN(0,1/M)$ entries.
arXiv Detail & Related papers (2020-05-22T16:55:01Z)
Support recovery and sup-norm convergence rates for sparse pivotal estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.