Implicit Balancing and Regularization: Generalization and Convergence
Guarantees for Overparameterized Asymmetric Matrix Sensing
- URL: http://arxiv.org/abs/2303.14244v2
- Date: Fri, 30 Jun 2023 22:02:12 GMT
- Title: Implicit Balancing and Regularization: Generalization and Convergence
Guarantees for Overparameterized Asymmetric Matrix Sensing
- Authors: Mahdi Soltanolkotabi, Dominik St\"oger, Changzhi Xie
- Abstract summary: A series of recent papers have begun to generalize this role for non-random Positive Semi-Defin (PSD) matrix sensing problems.
In this paper, we show that the trajectory of the gradient descent from small random measurements moves towards solutions that are both globally well.
- Score: 28.77440901439686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been significant progress in understanding the
convergence and generalization properties of gradient-based methods for
training overparameterized learning models. However, many aspects including the
role of small random initialization and how the various parameters of the model
are coupled during gradient-based updates to facilitate good generalization
remain largely mysterious. A series of recent papers have begun to study this
role for non-convex formulations of symmetric Positive Semi-Definite (PSD)
matrix sensing problems which involve reconstructing a low-rank PSD matrix from
a few linear measurements. The underlying symmetry/PSDness is crucial to
existing convergence and generalization guarantees for this problem. In this
paper, we study a general overparameterized low-rank matrix sensing problem
where one wishes to reconstruct an asymmetric rectangular low-rank matrix from
a few linear measurements. We prove that an overparameterized model trained via
factorized gradient descent converges to the low-rank matrix generating the
measurements. We show that in this setting, factorized gradient descent enjoys
two implicit properties: (1) coupling of the trajectory of gradient descent
where the factors are coupled in various ways throughout the gradient update
trajectory and (2) an algorithmic regularization property where the iterates
show a propensity towards low-rank models despite the overparameterized nature
of the factorized model. These two implicit properties in turn allow us to show
that the gradient descent trajectory from small random initialization moves
towards solutions that are both globally optimal and generalize well.
Related papers
- On Learning Gaussian Multi-index Models with Gradient Flow [57.170617397894404]
We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data.
We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection.
arXiv Detail & Related papers (2023-10-30T17:55:28Z) - Provably Accelerating Ill-Conditioned Low-rank Estimation via Scaled
Gradient Descent, Even with Overparameterization [48.65416821017865]
This chapter introduces a new algorithmic approach, dubbed scaled gradient (ScaledGD)
It converges linearly at a constant rate independent of the condition number of the low-rank object.
It maintains the low periteration cost of gradient descent for a variety of tasks.
arXiv Detail & Related papers (2023-10-09T21:16:57Z) - Asymmetric matrix sensing by gradient descent with small random
initialization [0.8611782340880084]
We study the problem of reconstructing a low-rank matrix from a few linear measurements.
Our key contribution is introducing a continuous gradient flow equation that we call the $texted gradient flow$.
arXiv Detail & Related papers (2023-09-04T20:23:35Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Understanding Incremental Learning of Gradient Descent: A Fine-grained
Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models.
This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z) - Demystifying the Global Convergence Puzzle of Learning
Over-parameterized ReLU Nets in Very High Dimensions [1.3401746329218014]
This paper is devoted to rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-dimensionalized data.
A major ingredient of our theory is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that
arXiv Detail & Related papers (2022-06-05T02:14:21Z) - Small random initialization is akin to spectral learning: Optimization
and generalization guarantees for overparameterized low-rank matrix
reconstruction [35.585697639325105]
In this paper we show that small random initialization are not fully understood.
We reconstruct a gradient from a small randomrank matrix and find solutions akin to an optimal gradient from a low randomrank matrix.
arXiv Detail & Related papers (2021-06-28T22:52:39Z) - Beyond Procrustes: Balancing-Free Gradient Descent for Asymmetric
Low-Rank Matrix Sensing [36.96922859748537]
Low-rank matrix estimation plays a central role in various applications across science and engineering.
Existing approaches rely on adding a metric regularization term to balance the scale of the two matrix factors.
In this paper, we provide a theoretical justification for the performance in recovering a low-rank matrix from a small number of linear measurements.
arXiv Detail & Related papers (2021-01-13T15:03:52Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Semiparametric Nonlinear Bipartite Graph Representation Learning with
Provable Guarantees [106.91654068632882]
We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution.
We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate.
Our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.
arXiv Detail & Related papers (2020-03-02T16:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.