Interplay between depth and width for interpolation in neural ODEs
- URL: http://arxiv.org/abs/2401.09902v3
- Date: Tue, 6 Feb 2024 17:05:48 GMT
- Title: Interplay between depth and width for interpolation in neural ODEs
- Authors: Antonio \'Alvarez-L\'opez, Arselane Hadj Slimane, Enrique Zuazua
- Abstract summary: We examine the interplay between their width $p$ and number of layer transitions $L$.
In the high-dimensional setting, we demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact control.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural ordinary differential equations (neural ODEs) have emerged as a
natural tool for supervised learning from a control perspective, yet a complete
understanding of their optimal architecture remains elusive. In this work, we
examine the interplay between their width $p$ and number of layer transitions
$L$ (effectively the depth $L+1$). Specifically, we assess the model
expressivity in terms of its capacity to interpolate either a finite dataset
$D$ comprising $N$ pairs of points or two probability measures in
$\mathbb{R}^d$ within a Wasserstein error margin $\varepsilon>0$. Our findings
reveal a balancing trade-off between $p$ and $L$, with $L$ scaling as
$O(1+N/p)$ for dataset interpolation, and
$L=O\left(1+(p\varepsilon^d)^{-1}\right)$ for measure interpolation.
In the autonomous case, where $L=0$, a separate study is required, which we
undertake focusing on dataset interpolation. We address the relaxed problem of
$\varepsilon$-approximate controllability and establish an error decay of
$\varepsilon\sim O(\log(p)p^{-1/d})$. This decay rate is a consequence of
applying a universal approximation theorem to a custom-built Lipschitz vector
field that interpolates $D$. In the high-dimensional setting, we further
demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact
control.
Related papers
- Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - On the Multidimensional Random Subset Sum Problem [0.9007371440329465]
In the Random Subset Sum Problem, given $n$ i.i.d. random variables $X_1,..., X_n$, we wish to approximate any point $z in [-1,1]$ as the sum of a subset $X_i_1(z),..., X_i_s(z)$ of them, up to error $varepsilon cdot.
We prove that, in $d$ dimensions, $n = O(d3log frac 1varepsilon cdot
arXiv Detail & Related papers (2022-07-28T08:10:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Breaking The Dimension Dependence in Sparse Distribution Estimation
under Communication Constraints [18.03695167610009]
We show that when sample size $n$ exceeds a minimum threshold $n*(s, d, b)$, we can achieve an $ell$ estimation error of $Oleft( fracsn2bright)$.
For the interactive setting, we propose a novel tree-based estimation scheme and show that the minimum sample-size needed to achieve dimension-free convergence can be further reduced to $n*(s, d, b)$.
arXiv Detail & Related papers (2021-06-16T07:52:14Z) - Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.