Convergence Analysis of Over-parameterized Deep Linear Networks, and the
Principal Components Bias
- URL: http://arxiv.org/abs/2105.05553v1
- Date: Wed, 12 May 2021 10:08:42 GMT
- Title: Convergence Analysis of Over-parameterized Deep Linear Networks, and the
Principal Components Bias
- Authors: Guy Hacohen and Daphna Weinshall
- Abstract summary: Convolutional Neural networks learn to classify images in the same order.
We show how the PC-bias streamlines the order of learning of both linear and non-linear networks.
- Score: 14.46779433267854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural networks of different architectures seem to learn to
classify images in the same order. To understand this phenomenon, we revisit
the over-parametrized deep linear network model. Our analysis of this model's
learning dynamics reveals that the convergence rate of its parameters is
exponentially faster along directions corresponding to the larger principal
components of the data, at a rate governed by the singular values. We term this
convergence pattern the Principal Components bias (PC-bias). We show how the
PC-bias streamlines the order of learning of both linear and non-linear
networks, more prominently in earlier stages of learning. We then compare our
results to the spectral bias, showing that both biases can be seen
independently, and affect the order of learning in different ways. Finally, we
discuss how the PC-bias can explain several phenomena, including the benefits
of prevalent initialization schemes, how early stopping may be related to PCA,
and why deep networks converge more slowly when given random labels.
Related papers
- On Privileged and Convergent Bases in Neural Network Representations [7.888192939262696]
We show that even in wide networks such as WideResNets, neural networks do not converge to a unique basis.
We also analyze Linear Mode Connectivity, which has been studied as a measure of basis correlation.
arXiv Detail & Related papers (2023-07-24T17:11:39Z) - Predictions Based on Pixel Data: Insights from PDEs and Finite Differences [0.0]
This paper deals with approximation of time sequences where each observation is a matrix.
We show that with relatively small networks, we can represent exactly a class of numerical discretizations of PDEs based on the method of lines.
Our network architecture is inspired by those typically adopted in the approximation of time sequences.
arXiv Detail & Related papers (2023-05-01T08:54:45Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Deep Architecture Connectivity Matters for Its Convergence: A
Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training.
We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - The Principles of Deep Learning Theory [19.33681537640272]
This book develops an effective theory approach to understanding deep neural networks of practical relevance.
We explain how these effectively-deep networks learn nontrivial representations from training.
We show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks.
arXiv Detail & Related papers (2021-06-18T15:00:00Z) - Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis.
By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner.
This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z) - Learning from Failure: Training Debiased Classifier from Biased
Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge.
We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously.
Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z) - The Heterogeneity Hypothesis: Finding Layer-Wise Differentiated Network
Architectures [179.66117325866585]
We investigate a design space that is usually overlooked, i.e. adjusting the channel configurations of predefined networks.
We find that this adjustment can be achieved by shrinking widened baseline networks and leads to superior performance.
Experiments are conducted on various networks and datasets for image classification, visual tracking and image restoration.
arXiv Detail & Related papers (2020-06-29T17:59:26Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.