On the Global Convergence of Training Deep Linear ResNets
- URL: http://arxiv.org/abs/2003.01094v1
- Date: Mon, 2 Mar 2020 18:34:49 GMT
- Title: On the Global Convergence of Training Deep Linear ResNets
- Authors: Difan Zou and Philip M. Long and Quanquan Gu
- Abstract summary: We study the convergence of gradient descent (GD) and gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets)
We prove that for training deep residual networks with certain linear transformations at input and output layers, both GD and SGD can converge to the global minimum of the training loss.
- Score: 104.76256863926629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the convergence of gradient descent (GD) and stochastic gradient
descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets).
We prove that for training deep residual networks with certain linear
transformations at input and output layers, which are fixed throughout
training, both GD and SGD with zero initialization on all hidden weights can
converge to the global minimum of the training loss. Moreover, when
specializing to appropriate Gaussian random linear transformations, GD and SGD
provably optimize wide enough deep linear ResNets. Compared with the global
convergence result of GD for training standard deep linear networks (Du & Hu
2019), our condition on the neural network width is sharper by a factor of
$O(\kappa L)$, where $\kappa$ denotes the condition number of the covariance
matrix of the training data. We further propose a modified identity input and
output transformations, and show that a $(d+k)$-wide neural network is
sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the
input and output dimensions respectively.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Generalization Error Bounds for Deep Neural Networks Trained by SGD [3.148524502470734]
Generalization error bounds for deep trained by gradient descent (SGD) are derived.
The bounds explicitly depend on the loss along the training trajectory.
Results show that our bounds are non-vacuous and robust with the change of neural networks and network hypers.
arXiv Detail & Related papers (2022-06-07T13:46:10Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.