Large-time asymptotics in deep learning
- URL: http://arxiv.org/abs/2008.02491v2
- Date: Mon, 29 Mar 2021 20:37:39 GMT
- Title: Large-time asymptotics in deep learning
- Authors: Carlos Esteve, Borjan Geshkovski, Dario Pighin, Enrique Zuazua
- Abstract summary: We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the neural ODE perspective of supervised learning and study the
impact of the final time $T$ (which may indicate the depth of a corresponding
ResNet) in training. For the classical $L^2$--regularized empirical risk
minimization problem, whenever the neural ODE dynamics are homogeneous with
respect to the parameters, we show that the training error is at most of the
order $\mathcal{O}\left(\frac{1}{T}\right)$. Furthermore, if the loss inducing
the empirical risk attains its minimum, the optimal parameters converge to
minimal $L^2$--norm parameters which interpolate the dataset. By a natural
scaling between $T$ and the regularization hyperparameter $\lambda$ we obtain
the same results when $\lambda\searrow0$ and $T$ is fixed. This allows us to
stipulate generalization properties in the overparametrized regime, now seen
from the large depth, neural ODE perspective. To enhance the polynomial decay,
inspired by turnpike theory in optimal control, we propose a learning problem
with an additional integral regularization term of the neural ODE trajectory
over $[0,T]$. In the setting of $\ell^p$--distance losses, we prove that both
the training error and the optimal parameters are at most of the order
$\mathcal{O}\left(e^{-\mu t}\right)$ in any $t\in[0,T]$. The aforementioned
stability estimates are also shown for continuous space-time neural networks,
taking the form of nonlinear integro-differential equations. By using a
time-dependent moving grid for discretizing the spatial variable, we
demonstrate that these equations provide a framework for addressing ResNets
with variable widths.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Interplay between depth and width for interpolation in neural ODEs [0.0]
We examine the interplay between their width $p$ and number of layer transitions $L$.
In the high-dimensional setting, we demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact control.
arXiv Detail & Related papers (2024-01-18T11:32:50Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and
Sparsity [9.077741848403791]
We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_ell$ of the training set.
This reformulation reveals the dynamics behind feature learning.
arXiv Detail & Related papers (2022-05-31T14:10:15Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.