Related papers: Overparameterization of deep ResNet: zero loss and mean-field analysis

Overparameterization of deep ResNet: zero loss and mean-field analysis

URL: http://arxiv.org/abs/2105.14417v1
Date: Sun, 30 May 2021 02:46:09 GMT
Title: Overparameterization of deep ResNet: zero loss and mean-field analysis
Authors: Zhiyan Ding and Shi Chen and Qin Li and Stephen Wright
Abstract summary: Finding parameters in a deep neural network (NN) that fit data is a non optimization problem. We show that a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
Score: 19.45069138853531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of neurons in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit. Next, we show that the solution to the PDE converges in the training time to a zero-loss solution. Together, these results imply that training of the ResNet also gives a near-zero loss if the Resnet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.

Related papers

Optimization and generalization analysis for two-layer physics-informed neural networks without over-parametrization [0.6215404942415159]
This work focuses on the behavior of gradient descent (SGD) in solving least-squares regression with physics-informed neural networks (PINNs)<n>We show that if the network width exceeds a threshold that depends only on $epsilon$ and the problem, then the training loss and expected loss will decrease below $O(epsilon)$.
arXiv Detail & Related papers (2025-07-22T09:24:22Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport [26.47265060394168]
We show that the gradient flow for deep neural networks converges arbitrarily at a distance ofr. This is done by relying on the theory of gradient distance of finite width in spaces.
arXiv Detail & Related papers (2024-03-19T16:34:31Z)
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Do Residual Neural Networks discretize Neural Ordinary Differential Equations? [8.252615417740879]
We first quantify the distance between the ResNet's hidden state trajectory and the solution of its corresponding Neural ODE. We show that this smoothness is preserved by gradient descent for a ResNet with linear residual functions and small enough initial loss.
arXiv Detail & Related papers (2022-05-29T09:29:34Z)
Global convergence of ResNets: From finite to infinite width using linear parameterization [0.0]
We study Residual Networks (ResNets) in which the residual block has linear parametrization while still being nonlinear. In this limit, we prove a local Polyak-Lojasiewicz inequality, retrieving the lazy regime. Our analysis leads to a practical and quantified recipe.
arXiv Detail & Related papers (2021-12-10T13:38:08Z)
A global convergence theory for deep ReLU implicit networks via over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently. This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z)
On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime [19.45069138853531]
First-order methods find the global optimum in the globalized regime. We show that if the ResNet is sufficiently large, with depth width depending on the accuracy and confidence levels, first-order methods can find optimization that fit the data.
arXiv Detail & Related papers (2021-10-06T17:16:09Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.