The Law of Parsimony in Gradient Descent for Learning Deep Linear
Networks
- URL: http://arxiv.org/abs/2306.01154v1
- Date: Thu, 1 Jun 2023 21:24:53 GMT
- Title: The Law of Parsimony in Gradient Descent for Learning Deep Linear
Networks
- Authors: Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu
- Abstract summary: We reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures.
This simplicity in learning dynamics could have significant implications for both efficient training and a better understanding of deep networks.
- Score: 34.85235641812005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past few years, an extensively studied phenomenon in training deep
networks is the implicit bias of gradient descent towards parsimonious
solutions. In this work, we investigate this phenomenon by narrowing our focus
to deep linear networks. Through our analysis, we reveal a surprising "law of
parsimony" in the learning dynamics when the data possesses low-dimensional
structures. Specifically, we show that the evolution of gradient descent
starting from orthogonal initialization only affects a minimal portion of
singular vector spaces across all weight matrices. In other words, the learning
process happens only within a small invariant subspace of each weight matrix,
despite the fact that all weight parameters are updated throughout training.
This simplicity in learning dynamics could have significant implications for
both efficient training and a better understanding of deep networks. First, the
analysis enables us to considerably improve training efficiency by taking
advantage of the low-dimensional structure in learning dynamics. We can
construct smaller, equivalent deep linear networks without sacrificing the
benefits associated with the wider counterparts. Second, it allows us to better
understand deep representation learning by elucidating the linear progressive
separation and concentration of representations from shallow to deep layers. We
also conduct numerical experiments to support our theoretical results. The code
for our experiments can be found at https://github.com/cjyaras/lawofparsimony.
Related papers
- Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Understanding Deep Representation Learning via Layerwise Feature
Compression and Discrimination [33.273226655730326]
We show that each layer of a deep linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate.
This is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks.
arXiv Detail & Related papers (2023-11-06T09:00:38Z) - Understanding Deep Neural Networks via Linear Separability of Hidden
Layers [68.23950220548417]
We first propose Minkowski difference based linear separability measures (MD-LSMs) to evaluate the linear separability degree of two points sets.
We demonstrate that there is a synchronicity between the linear separability degree of hidden layer outputs and the network training performance.
arXiv Detail & Related papers (2023-07-26T05:29:29Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization.
We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks.
Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z) - Towards Understanding Learning in Neural Networks with Linear Teachers [31.849269592822296]
We prove that SGD globally optimize this learning problem for a two-layer network with Leaky ReLU activations.
We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary.
arXiv Detail & Related papers (2021-01-07T13:21:24Z) - Statistical Mechanics of Deep Linear Neural Networks: The
Back-Propagating Renormalization Group [4.56877715768796]
We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear.
We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space.
Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
arXiv Detail & Related papers (2020-12-07T20:08:31Z) - Align, then memorise: the dynamics of learning with feedback alignment [12.587037358391418]
Direct Feedback Alignment (DFA) is an efficient alternative to the ubiquitous backpropagation algorithm for training deep neural networks.
DFA successfully trains state-of-the-art models such as Transformers, but it notoriously fails to train convolutional networks.
Here, we propose a theory for the success of DFA.
arXiv Detail & Related papers (2020-11-24T22:21:27Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.