Global convergence of ResNets: From finite to infinite width using
linear parameterization
- URL: http://arxiv.org/abs/2112.05531v2
- Date: Mon, 6 Feb 2023 13:42:45 GMT
- Title: Global convergence of ResNets: From finite to infinite width using
linear parameterization
- Authors: Rapha\"el Barboni (ENS-PSL), Gabriel Peyr\'e (ENS-PSL, CNRS),
Fran\c{c}ois-Xavier Vialard (LIGM)
- Abstract summary: We study Residual Networks (ResNets) in which the residual block has linear parametrization while still being nonlinear.
In this limit, we prove a local Polyak-Lojasiewicz inequality, retrieving the lazy regime.
Our analysis leads to a practical and quantified recipe.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparametrization is a key factor in the absence of convexity to explain
global convergence of gradient descent (GD) for neural networks. Beside the
well studied lazy regime, infinite width (mean field) analysis has been
developed for shallow networks, using on convex optimization technics. To
bridge the gap between the lazy and mean field regimes, we study Residual
Networks (ResNets) in which the residual block has linear parametrization while
still being nonlinear. Such ResNets admit both infinite depth and width limits,
encoding residual blocks in a Reproducing Kernel Hilbert Space (RKHS). In this
limit, we prove a local Polyak-Lojasiewicz inequality. Thus, every critical
point is a global minimizer and a local convergence result of GD holds,
retrieving the lazy regime. In contrast with other mean-field studies, it
applies to both parametric and non-parametric cases under an expressivity
condition on the residuals. Our analysis leads to a practical and quantified
recipe: starting from a universal RKHS, Random Fourier Features are applied to
obtain a finite dimensional parameterization satisfying with high-probability
our expressivity condition.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - On the Global Convergence of Gradient Descent for multi-layer ResNets in
the mean-field regime [19.45069138853531]
First-order methods find the global optimum in the globalized regime.
We show that if the ResNet is sufficiently large, with depth width depending on the accuracy and confidence levels, first-order methods can find optimization that fit the data.
arXiv Detail & Related papers (2021-10-06T17:16:09Z) - Overparameterization of deep ResNet: zero loss and mean-field analysis [19.45069138853531]
Finding parameters in a deep neural network (NN) that fit data is a non optimization problem.
We show that a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations.
We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
arXiv Detail & Related papers (2021-05-30T02:46:09Z) - Convex Geometry and Duality of Over-parameterized Neural Networks [70.15611146583068]
We develop a convex analytic approach to analyze finite width two-layer ReLU networks.
We show that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set.
In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints.
arXiv Detail & Related papers (2020-02-25T23:05:33Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.