Towards Resolving the Implicit Bias of Gradient Descent for Matrix
Factorization: Greedy Low-Rank Learning
- URL: http://arxiv.org/abs/2012.09839v2
- Date: Sun, 11 Apr 2021 12:40:57 GMT
- Title: Towards Resolving the Implicit Bias of Gradient Descent for Matrix
Factorization: Greedy Low-Rank Learning
- Authors: Zhiyuan Li, Yuping Luo, Kaifeng Lyu
- Abstract summary: Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of descent gradient.
We show that for depth-2 matrix factorization, flow with infinitesimal initialization is mathematically equivalent to a simple rank minimization algorithm.
- Score: 19.82453283089643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Matrix factorization is a simple and natural test-bed to investigate the
implicit regularization of gradient descent. Gunasekar et al. (2017)
conjectured that Gradient Flow with infinitesimal initialization converges to
the solution that minimizes the nuclear norm, but a series of recent papers
argued that the language of norm minimization is not sufficient to give a full
characterization for the implicit regularization. In this work, we provide
theoretical and empirical evidence that for depth-2 matrix factorization,
gradient flow with infinitesimal initialization is mathematically equivalent to
a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under
some reasonable assumptions. This generalizes the rank minimization view from
previous works to a much broader setting and enables us to construct
counter-examples to refute the conjecture from Gunasekar et al. (2017). We also
extend the results to the case where depth $\ge 3$, and we show that the
benefit of being deeper is that the above convergence has a much weaker
dependence over initialization magnitude so that this rank minimization is more
likely to take effect for initialization with practical scale.
Related papers
- Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity [11.412228884390784]
We study the problem of reconstructing a low-rank quadratic convex matrix from a few measurements.
We show that factorized gradient with spectral specificity converges to truth with the number of samples.
arXiv Detail & Related papers (2024-08-20T14:09:28Z) - Deep linear networks for regression are implicitly regularized towards flat minima [4.806579822134391]
Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one.
We show a lower bound on the sharpness of minimizers, which grows linearly with depth.
We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound.
arXiv Detail & Related papers (2024-05-22T08:58:51Z) - Riemannian stochastic optimization methods avoid strict saddle points [68.80251170757647]
We show that policies under study avoid strict saddle points / submanifolds with probability 1.
This result provides an important sanity check as it shows that, almost always, the limit state of an algorithm can only be a local minimizer.
arXiv Detail & Related papers (2023-11-04T11:12:24Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Noisy Gradient Descent Converges to Flat Minima for Nonconvex Matrix
Factorization [36.182992409810446]
This paper investigates the importance of noise in non optimization problems.
We show that gradient descent can converge to any global form that converges to a global bias that is determined by the injected noise.
arXiv Detail & Related papers (2021-02-24T17:50:17Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled
Gradient Descent [34.0533596121548]
Low-rank matrix estimation converges convex problem that finds numerous applications in signal processing, machine learning and imaging science.
We show that ScaledGD achieves the best of the best in terms of the number of the low-rank matrix.
Our analysis is also applicable to general loss that are similar to low-rank gradient descent.
arXiv Detail & Related papers (2020-05-18T17:17:16Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.