Related papers: Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

URL: http://arxiv.org/abs/2106.14289v1
Date: Sun, 27 Jun 2021 17:25:24 GMT
Title: Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization
Authors: Tian Ye and Simon S. Du
Abstract summary: We study the asymmetric low-rank factorization problem: [mathbfU in mathbbRm min d, mathbfU$ and mathV$.
Score: 49.090785356633695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the asymmetric low-rank factorization problem: \[\min_{\mathbf{U} \in \mathbb{R}^{m \times d}, \mathbf{V} \in \mathbb{R}^{n \times d}} \frac{1}{2}\|\mathbf{U}\mathbf{V}^\top -\mathbf{\Sigma}\|_F^2\] where $\mathbf{\Sigma}$ is a given matrix of size $m \times n$ and rank $d$. This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and $\mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.

Related papers

The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$. As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z)
Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks [46.04785603483612]
We prove that Nesterov's accelerated gradient attains an complexity $O(kappalogfrac1epsilon)$. In particular, we prove that NAG can also attain an accelerated linear convergence rate.
arXiv Detail & Related papers (2024-10-12T20:33:37Z)
In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients. We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization [46.55524654398093]
We show how over- parameterization changes the convergence behaviors of descent. The goal is to recover an unknown low-rank ground-rank ground-truth matrix from near-isotropic linear measurements. We propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $alpha$.
arXiv Detail & Related papers (2023-10-03T03:34:22Z)
Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations. The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z)
Multiplicative updates for symmetric-cone factorizations [0.0]
We introduce and analyze the symmetric-cone multiplicative update (SCMU) algorithm for computing cone factorizations. We show that the squared loss objective is non-decreasing along the trajectories of the SCMU algorithm.
arXiv Detail & Related papers (2021-08-02T09:23:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.