Noisy Gradient Descent Converges to Flat Minima for Nonconvex Matrix
Factorization
- URL: http://arxiv.org/abs/2102.12430v1
- Date: Wed, 24 Feb 2021 17:50:17 GMT
- Title: Noisy Gradient Descent Converges to Flat Minima for Nonconvex Matrix
Factorization
- Authors: Tianyi Liu, Yan Li, Song Wei, Enlu Zhou and Tuo Zhao
- Abstract summary: This paper investigates the importance of noise in non optimization problems.
We show that gradient descent can converge to any global form that converges to a global bias that is determined by the injected noise.
- Score: 36.182992409810446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous empirical evidences have corroborated the importance of noise in
nonconvex optimization problems. The theory behind such empirical observations,
however, is still largely unknown. This paper studies this fundamental problem
through investigating the nonconvex rectangular matrix factorization problem,
which has infinitely many global minima due to rotation and scaling invariance.
Hence, gradient descent (GD) can converge to any optimum, depending on the
initialization. In contrast, we show that a perturbed form of GD with an
arbitrary initialization converges to a global optimum that is uniquely
determined by the injected noise. Our result implies that the noise imposes
implicit bias towards certain optima. Numerical experiments are provided to
support our theory.
Related papers
- On the phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance [5.058205542605482]
We make progress towards the understanding of matrix denoising when the hidden signal is a factored matrix $XXintercal$ that is not rotationally invariant.
We argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to sign and permutation universality.
We also argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to sign and permutation universality.
arXiv Detail & Related papers (2024-11-04T10:50:37Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - High-Probability Bounds for Stochastic Optimization and Variational
Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions.
These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z) - Robust Matrix Completion with Heavy-tailed Noise [0.5837881923712392]
This paper studies low-rank matrix completion in the presence of heavy-tailed possibly asymmetric noise.
In this paper, we adopt adaptive Huber loss accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors.
We prove that under merely a second moment condition on the error, the Euclidean error falls geometrically fast until achieving a minimax-optimal statistical estimation error.
arXiv Detail & Related papers (2022-06-09T04:48:48Z) - Minimax Optimization: The Case of Convex-Submodular [50.03984152441271]
Minimax problems extend beyond the continuous domain to mixed continuous-discrete domains or even fully discrete domains.
We introduce the class of convex-submodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable.
Our proposed algorithms are iterative and combine tools from both discrete and continuous optimization.
arXiv Detail & Related papers (2021-11-01T21:06:35Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - On the Role of Entropy-based Loss for Learning Causal Structures with
Continuous Optimization [27.613220411996025]
A method with non-combinatorial directed acyclic constraint, called NOTEARS, formulates the causal structure learning problem as a continuous optimization problem using least-square loss.
We show that the violation of the Gaussian noise assumption will hinder the causal direction identification.
We propose a more general entropy-based loss that is theoretically consistent with the likelihood score under any noise distribution.
arXiv Detail & Related papers (2021-06-05T08:29:51Z) - Towards Resolving the Implicit Bias of Gradient Descent for Matrix
Factorization: Greedy Low-Rank Learning [19.82453283089643]
Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of descent gradient.
We show that for depth-2 matrix factorization, flow with infinitesimal initialization is mathematically equivalent to a simple rank minimization algorithm.
arXiv Detail & Related papers (2020-12-17T18:57:01Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.