Deep orthogonal linear networks are shallow
- URL: http://arxiv.org/abs/2011.13831v1
- Date: Fri, 27 Nov 2020 16:57:19 GMT
- Title: Deep orthogonal linear networks are shallow
- Authors: Pierre Ablin
- Abstract summary: We show that training the weights with gradient gradient descent is equivalent to training the whole factorization by gradient descent.
This means that there is no effect of overparametrization and implicit bias at all in this setting.
- Score: 9.434391240650266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of training a deep orthogonal linear network, which
consists of a product of orthogonal matrices, with no non-linearity in-between.
We show that training the weights with Riemannian gradient descent is
equivalent to training the whole factorization by gradient descent. This means
that there is no effect of overparametrization and implicit bias at all in this
setting: training such a deep, overparametrized, network is perfectly
equivalent to training a one-layer shallow network.
Related papers
- Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment [0.0]
We use an exponential solver to train a neural network without entering the edge of stability.
We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned.
arXiv Detail & Related papers (2024-05-31T18:37:06Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - A Unifying View on Implicit Bias in Training Linear Neural Networks [31.65006970108761]
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training.
We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
arXiv Detail & Related papers (2020-10-06T06:08:35Z) - On the linearity of large non-linear models: when and why the tangent
kernel is constant [20.44438519046223]
We shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity.
We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network.
arXiv Detail & Related papers (2020-10-02T16:44:45Z) - From deep to Shallow: Equivalent Forms of Deep Networks in Reproducing
Kernel Krein Space and Indefinite Support Vector Machines [63.011641517977644]
We take a deep network and convert it to an equivalent (indefinite) kernel machine.
We then investigate the implications of this transformation for capacity control and uniform convergence.
Finally, we analyse the sparsity properties of the flat representation, showing that the flat weights are (effectively) Lp-"norm" regularised with 0p1.
arXiv Detail & Related papers (2020-07-15T03:21:35Z) - Neural Subdivision [58.97214948753937]
This paper introduces Neural Subdivision, a novel framework for data-driven coarseto-fine geometry modeling.
We optimize for the same set of network weights across all local mesh patches, thus providing an architecture that is not constrained to a specific input mesh, fixed genus, or category.
We demonstrate that even when trained on a single high-resolution mesh our method generates reasonable subdivisions for novel shapes.
arXiv Detail & Related papers (2020-05-04T20:03:21Z) - Eigendecomposition-Free Training of Deep Networks for Linear
Least-Square Problems [107.3868459697569]
We introduce an eigendecomposition-free approach to training a deep network.
We show that our approach is much more robust than explicit differentiation of the eigendecomposition.
Our method has better convergence properties and yields state-of-the-art results.
arXiv Detail & Related papers (2020-04-15T04:29:34Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.