Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer
- URL: http://arxiv.org/abs/2201.02761v1
- Date: Sat, 8 Jan 2022 04:44:59 GMT
- Title: Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer
- Authors: Kun Chen, Dachao Lin, Zhihua Zhang
- Abstract summary: We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss.
We describe the convergent point of trajectories with arbitrary starting point under flow.
We show specific convergence rates of trajectories that converge to the global gradientr by stages.
- Score: 18.06634056613645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we follow Eftekhari's work to give a non-local convergence
analysis of deep linear networks. Specifically, we consider optimizing deep
linear networks which have a layer with one neuron under quadratic loss. We
describe the convergent point of trajectories with arbitrary starting point
under gradient flow, including the paths which converge to one of the saddle
points or the original point. We also show specific convergence rates of
trajectories that converge to the global minimizer by stages. To achieve these
results, this paper mainly extends the machinery in Eftekhari's work to
provably identify the rank-stable set and the global minimizer convergent set.
We also give specific examples to show the necessity of our definitions.
Crucially, as far as we know, our results appear to be the first to give a
non-local global analysis of linear neural networks from arbitrary initialized
points, rather than the lazy training regime which has dominated the literature
of neural networks, and restricted benign initialization in Eftekhari's work.
We also note that extending our results to general linear networks without one
hidden neuron assumption remains a challenging open problem.
Related papers
- Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias
for Correlated Inputs [5.7166378791349315]
We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network converges to zero loss.
We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm.
arXiv Detail & Related papers (2023-06-10T16:36:22Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss.
We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Directional Convergence Analysis under Spherically Symmetric
Distribution [21.145823611499104]
We consider the fundamental problem of learning linear predictors (i.e., separable datasets with zero margin) using neural networks with gradient flow or gradient descent.
We show directional convergence guarantees with exact convergence rate for two-layer non-linear networks with only two hidden nodes, and (deep) linear networks.
arXiv Detail & Related papers (2021-05-09T08:59:58Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.