Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank
- URL: http://arxiv.org/abs/2011.13772v5
- Date: Mon, 21 Aug 2023 00:53:49 GMT
- Title: Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank
- Authors: Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, Holger Rauhut
- Abstract summary: In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
- Score: 1.9350867959464846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In deep learning, it is common to use more network parameters than training
points. In such scenarioof over-parameterization, there are usually multiple
networks that achieve zero training error so that thetraining algorithm induces
an implicit bias on the computed solution. In practice, (stochastic)
gradientdescent tends to prefer solutions which generalize well, which provides
a possible explanation of thesuccess of deep learning. In this paper we analyze
the dynamics of gradient descent in the simplifiedsetting of linear networks
and of an estimation problem. Although we are not in an
overparameterizedscenario, our analysis nevertheless provides insights into the
phenomenon of implicit bias. In fact, wederive a rigorous analysis of the
dynamics of vanilla gradient descent, and characterize the dynamicalconvergence
of the spectrum. We are able to accurately locate time intervals where the
effective rankof the iterates is close to the effective rank of a low-rank
projection of the ground-truth matrix. Inpractice, those intervals can be used
as criteria for early stopping if a certain regularity is desired. Wealso
provide empirical evidence for implicit bias in more general scenarios, such as
matrix sensing andrandom initialization. This suggests that deep learning
prefers trajectories whose complexity (measuredin terms of effective rank) is
monotonically increasing, which we believe is a fundamental concept for
thetheoretical understanding of deep learning.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Understanding Incremental Learning of Gradient Descent: A Fine-grained
Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models.
This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z) - A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix
Factorization [21.64166573203593]
Implicit regularization is an important way to interpret neural networks.
Recent theory starts to explain implicit regularization with the model of deep matrix factorization (DMF)
arXiv Detail & Related papers (2022-12-29T02:11:19Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Depth Without the Magic: Inductive Bias of Natural Gradient Descent [1.020554144865699]
In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories.
We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization.
We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
arXiv Detail & Related papers (2021-11-22T21:20:10Z) - Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature.
This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum.
We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z) - Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Theoretical Analysis of Self-Training with Deep Networks on Unlabeled
Data [48.4779912667317]
Self-training algorithms have been very successful for learning with unlabeled data using neural networks.
This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning.
arXiv Detail & Related papers (2020-10-07T19:43:55Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.