Gradient descent in matrix factorization: Understanding large initialization
- URL: http://arxiv.org/abs/2305.19206v2
- Date: Fri, 31 May 2024 20:59:59 GMT
- Title: Gradient descent in matrix factorization: Understanding large initialization
- Authors: Hengchao Chen, Xin Chen, Mohamad Elmasri, Qiang Sun,
- Abstract summary: The framework is grounded in signal-to-noise ratio concepts and inductive arguments.
The results uncover an implicit incremental learning phenomenon in GD and offer a deeper understanding of its performance in large scenarios.
- Score: 6.378022003282206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient Descent (GD) has been proven effective in solving various matrix factorization problems. However, its optimization behavior with large initial values remains less understood. To address this gap, this paper presents a novel theoretical framework for examining the convergence trajectory of GD with a large initialization. The framework is grounded in signal-to-noise ratio concepts and inductive arguments. The results uncover an implicit incremental learning phenomenon in GD and offer a deeper understanding of its performance in large initialization scenarios.
Related papers
- Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization [7.940066909711888]
We analyze the learning dynamics of Low-Rank Adaptation (LoRA) for matrix factorization under gradient flow (GF)
Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix.
arXiv Detail & Related papers (2025-03-10T06:57:10Z) - On the Crucial Role of Initialization for Matrix Factorization [40.834791383134416]
This work revisits the classical lowrank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates.
We introduce Nystrom NyGD in both symmetric asymmetric matrix factorization tasks and extend this to low-rank adapters (LoRA)
Our approach, NoRA, demonstrates superior performance across various downstream and model scales, from 1B to 7B parameters, in large language and diffusion models.
arXiv Detail & Related papers (2024-10-24T17:58:21Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - Understanding Incremental Learning of Gradient Descent: A Fine-grained
Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models.
This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z) - Rank-1 Matrix Completion with Gradient Descent and Small Random
Initialization [15.127728811011245]
We show that implicit regularization of GD plays a critical role in analysis.
We observe that implicit regularization GD plays a critical role in affordable analysis.
arXiv Detail & Related papers (2022-12-19T12:05:37Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - On the Implicit Bias of Initialization Shape: Beyond Infinitesimal
Mirror Descent [55.96478231566129]
We show that relative scales play an important role in determining the learned model.
We develop a technique for deriving the inductive bias of gradient-flow.
arXiv Detail & Related papers (2021-02-19T07:10:48Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear
Networks [39.856439772974454]
We show that the width needed for efficient convergence to a global minimum is independent of the depth.
Our results suggest an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
arXiv Detail & Related papers (2020-01-16T18:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.