Do Residual Neural Networks discretize Neural Ordinary Differential
Equations?
- URL: http://arxiv.org/abs/2205.14612v1
- Date: Sun, 29 May 2022 09:29:34 GMT
- Title: Do Residual Neural Networks discretize Neural Ordinary Differential
Equations?
- Authors: Michael E. Sander, Pierre Ablin and Gabriel Peyr\'e
- Abstract summary: We first quantify the distance between the ResNet's hidden state trajectory and the solution of its corresponding Neural ODE.
We show that this smoothness is preserved by gradient descent for a ResNet with linear residual functions and small enough initial loss.
- Score: 8.252615417740879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural Ordinary Differential Equations (Neural ODEs) are the continuous
analog of Residual Neural Networks (ResNets). We investigate whether the
discrete dynamics defined by a ResNet are close to the continuous one of a
Neural ODE. We first quantify the distance between the ResNet's hidden state
trajectory and the solution of its corresponding Neural ODE. Our bound is tight
and, on the negative side, does not go to 0 with depth N if the residual
functions are not smooth with depth. On the positive side, we show that this
smoothness is preserved by gradient descent for a ResNet with linear residual
functions and small enough initial loss. It ensures an implicit regularization
towards a limit Neural ODE at rate 1 over N, uniformly with depth and
optimization time. As a byproduct of our analysis, we consider the use of a
memory-free discrete adjoint method to train a ResNet by recovering the
activations on the fly through a backward pass of the network, and show that
this method theoretically succeeds at large depth if the residual functions are
Lipschitz with the input. We then show that Heun's method, a second order ODE
integration scheme, allows for better gradient estimation with the adjoint
method when the residual functions are smooth with depth. We experimentally
validate that our adjoint method succeeds at large depth, and that Heun method
needs fewer layers to succeed. We finally use the adjoint method successfully
for fine-tuning very deep ResNets without memory consumption in the residual
layers.
Related papers
- Implicit regularization of deep residual networks towards neural ODEs [8.075122862553359]
We establish an implicit regularization of deep residual networks towards neural ODEs.
We prove that if the network is as a discretization of a neural ODE, then such a discretization holds throughout training.
arXiv Detail & Related papers (2023-09-03T16:35:59Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Learning via nonlinear conjugate gradients and depth-varying neural ODEs [5.565364597145568]
The inverse problem of supervised reconstruction of depth-variable parameters in a neural ordinary differential equation (NODE) is considered.
The proposed parameter reconstruction is done for a general first order differential equation by minimizing a cost functional.
The sensitivity problem can estimate changes in the network output under perturbation of the trained parameters.
arXiv Detail & Related papers (2022-02-11T17:00:48Z) - On the Global Convergence of Gradient Descent for multi-layer ResNets in
the mean-field regime [19.45069138853531]
First-order methods find the global optimum in the globalized regime.
We show that if the ResNet is sufficiently large, with depth width depending on the accuracy and confidence levels, first-order methods can find optimization that fit the data.
arXiv Detail & Related papers (2021-10-06T17:16:09Z) - Overparameterization of deep ResNet: zero loss and mean-field analysis [19.45069138853531]
Finding parameters in a deep neural network (NN) that fit data is a non optimization problem.
We show that a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations.
We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
arXiv Detail & Related papers (2021-05-30T02:46:09Z) - Online Limited Memory Neural-Linear Bandits with Likelihood Matching [53.18698496031658]
We study neural-linear bandits for solving problems where both exploration and representation learning play an important role.
We propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
arXiv Detail & Related papers (2021-02-07T14:19:07Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.