Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization
- URL: http://arxiv.org/abs/2205.10217v3
- Date: Sun, 21 May 2023 07:02:36 GMT
- Title: Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization
- Authors: Simone Bombari, Mohammad Hossein Amani, Marco Mondelli
- Abstract summary: The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks.
We show that the NTK is well conditioned in a challenging sub-linear setup.
Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
- Score: 14.186776881154127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide
memorization, optimization and generalization guarantees in deep neural
networks. A line of work has studied the NTK spectrum for two-layer and deep
networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number
of training samples. Furthermore, there is increasing evidence suggesting that
deep networks with sub-linear layer widths are powerful memorizers and
optimizers, as long as the number of parameters exceeds the number of samples.
Thus, a natural open question is whether the NTK is well conditioned in such a
challenging sub-linear setup. In this paper, we answer this question in the
affirmative. Our key technical contribution is a lower bound on the smallest
NTK eigenvalue for deep networks with the minimum possible
over-parameterization: the number of parameters is roughly $\Omega(N)$ and,
hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase
the applicability of our NTK bounds, we provide two results concerning
memorization capacity and optimization guarantees for gradient descent
training.
Related papers
- Efficient SGD Neural Network Training via Sublinear Activated Neuron
Identification [22.361338848134025]
We present a fully connected two-layer neural network for shifted ReLU activation to enable activated neuron identification in sublinear time via geometric search.
We also prove that our algorithm can converge in $O(M2/epsilon2)$ time with network size quadratic in the coefficient norm upper bound $M$ and error term $epsilon$.
arXiv Detail & Related papers (2023-07-13T05:33:44Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Identifying good directions to escape the NTK regime and efficiently
learn low-degree plus sparse polynomials [52.11466135206223]
We show that a wide two-layer neural network can jointly use the Tangent Kernel (NTK) and the QuadNTK to fit target functions.
This yields an end to end convergence and guarantee with provable sample improvement over both the NTK and QuadNTK on their own.
arXiv Detail & Related papers (2022-06-08T06:06:51Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for
Deep ReLU Networks [21.13299067136635]
We provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU networks.
In the finite-width setting, the network architectures we consider are quite general.
arXiv Detail & Related papers (2020-12-21T19:32:17Z) - Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z) - Provable Memorization via Deep Neural Networks using Sub-linear
Parameters [91.0268925267129]
It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs.
By exploiting depth, we show that $O(N2/3)$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points.
arXiv Detail & Related papers (2020-10-26T06:19:38Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Towards an Understanding of Residual Networks Using Neural Tangent
Hierarchy (NTH) [2.50686294157537]
Gradient descent yields zero loss in time for deep training networks despite non- infinite nature of the objective function.
In this paper, we trained neural dynamics of the NTK for finite width ResNet using Deep Residual Network (ResNet)
Our analysis suggests strongly that the particular neural-connection structure ResNet is the main reason for its triumph.
arXiv Detail & Related papers (2020-07-07T18:08:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.