Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix
Factorization
- URL: http://arxiv.org/abs/2106.14289v1
- Date: Sun, 27 Jun 2021 17:25:24 GMT
- Title: Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix
Factorization
- Authors: Tian Ye and Simon S. Du
- Abstract summary: We study the asymmetric low-rank factorization problem: [mathbfU in mathbbRm min d, mathbfU$ and mathV$.
- Score: 49.090785356633695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the asymmetric low-rank factorization problem: \[\min_{\mathbf{U}
\in \mathbb{R}^{m \times d}, \mathbf{V} \in \mathbb{R}^{n \times d}}
\frac{1}{2}\|\mathbf{U}\mathbf{V}^\top -\mathbf{\Sigma}\|_F^2\] where
$\mathbf{\Sigma}$ is a given matrix of size $m \times n$ and rank $d$. This is
a canonical problem that admits two difficulties in optimization: 1)
non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and
$\mathbf{V}$). This is also a prototype for more complex problems such as
asymmetric matrix sensing and matrix completion. Despite being non-convex and
non-smooth, it has been observed empirically that the randomly initialized
gradient descent algorithm can solve this problem in polynomial time. Existing
theories to explain this phenomenon all require artificial modifications of the
algorithm, such as adding noise in each iteration and adding a balancing
regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$.
This paper presents the first proof that shows randomly initialized gradient
descent converges to a global minimum of the asymmetric low-rank factorization
problem with a polynomial rate. For the proof, we develop 1) a new
symmetrization technique to capture the magnitudes of the symmetry and
asymmetry, and 2) a quantitative perturbation analysis to approximate matrix
derivatives. We believe both are useful for other related non-convex problems.
Related papers
- The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$.
As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z) - Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks [46.04785603483612]
We prove that Nesterov's accelerated gradient attains an complexity $O(kappalogfrac1epsilon)$.
In particular, we prove that NAG can also attain an accelerated linear convergence rate.
arXiv Detail & Related papers (2024-10-12T20:33:37Z) - In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients.
We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing:
The Curses of Symmetry and Initialization [46.55524654398093]
We show how over- parameterization changes the convergence behaviors of descent.
The goal is to recover an unknown low-rank ground-rank ground-truth matrix from near-isotropic linear measurements.
We propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $alpha$.
arXiv Detail & Related papers (2023-10-03T03:34:22Z) - Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective.
We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Multiplicative updates for symmetric-cone factorizations [0.0]
We introduce and analyze the symmetric-cone multiplicative update (SCMU) algorithm for computing cone factorizations.
We show that the squared loss objective is non-decreasing along the trajectories of the SCMU algorithm.
arXiv Detail & Related papers (2021-08-02T09:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.