Sparse approximation in learning via neural ODEs
- URL: http://arxiv.org/abs/2102.13566v1
- Date: Fri, 26 Feb 2021 16:23:02 GMT
- Title: Sparse approximation in learning via neural ODEs
- Authors: Carlos Esteve Yag\"ue and Borjan Geshkovski
- Abstract summary: We study the impact of the final time horizon $T$ in training.
In practical terms, a shorter time-horizon in the training problem can be interpreted as considering a shallower residual neural network.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the continuous-time, neural ordinary differential equation
(neural ODE) perspective of deep supervised learning, and study the impact of
the final time horizon $T$ in training. We focus on a cost consisting of an
integral of the empirical risk over the time interval, and $L^1$--parameter
regularization. Under homogeneity assumptions on the dynamics (typical for ReLU
activations), we prove that any global minimizer is sparse, in the sense that
there exists a positive stopping time $T^*$ beyond which the optimal parameters
vanish. Moreover, under appropriate interpolation assumptions on the neural
ODE, we provide quantitative estimates of the stopping time $T^\ast$, and of
the training error of the trajectories at the stopping time. The latter
stipulates a quantitative approximation property of neural ODE flows with
sparse parameters. In practical terms, a shorter time-horizon in the training
problem can be interpreted as considering a shallower residual neural network
(ResNet), and since the optimal parameters are concentrated over a shorter time
horizon, such a consideration may lower the computational cost of training
without discarding relevant information.
Related papers
- On Regularization via Early Stopping for Least Squares Regression [4.159762735751163]
We prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules.
We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
arXiv Detail & Related papers (2024-06-06T18:10:51Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Learning Lipschitz Functions by GD-trained Shallow Overparameterized
ReLU Neural Networks [12.018422134251384]
We show that neural networks trained to nearly zero training error are inconsistent in this class.
We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate.
arXiv Detail & Related papers (2022-12-28T14:56:27Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - DeepBayes -- an estimator for parameter estimation in stochastic
nonlinear dynamical models [11.917949887615567]
We propose DeepBayes estimators that leverage the power of deep recurrent neural networks in learning an estimator.
The deep recurrent neural network architectures can be trained offline and ensure significant time savings during inference.
We demonstrate the applicability of our proposed method on different example models and perform detailed comparisons with state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-04T18:12:17Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.