Related papers: Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization

Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization

URL: http://arxiv.org/abs/2205.10217v3
Date: Sun, 21 May 2023 07:02:36 GMT
Title: Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization
Authors: Simone Bombari, Mohammad Hossein Amani, Marco Mondelli
Abstract summary: The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. We show that the NTK is well conditioned in a challenging sub-linear setup. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
Score: 14.186776881154127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.

Related papers

On the Hardness of Training Deep Neural Networks Discretely [0.0]
We show that the hardness of neural network training (NNT) is dramatically affected by the network depth. Specifically, we show that, under standard assumptions, D-NNT is not in the class NP even for instances with fixed dimensions and dataset size. We complement these results with a comprehensive list of NP-hardness lower bounds for D-NNT on two-layer networks.
arXiv Detail & Related papers (2024-12-17T16:20:29Z)
Efficient SGD Neural Network Training via Sublinear Activated Neuron Identification [22.361338848134025]
We present a fully connected two-layer neural network for shifted ReLU activation to enable activated neuron identification in sublinear time via geometric search. We also prove that our algorithm can converge in $O(M2/epsilon2)$ time with network size quadratic in the coefficient norm upper bound $M$ and error term $epsilon$.
arXiv Detail & Related papers (2023-07-13T05:33:44Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials [52.11466135206223]
We show that a wide two-layer neural network can jointly use the Tangent Kernel (NTK) and the QuadNTK to fit target functions. This yields an end to end convergence and guarantee with provable sample improvement over both the NTK and QuadNTK on their own.
arXiv Detail & Related papers (2022-06-08T06:06:51Z)
Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z)
Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks [21.13299067136635]
We provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU networks. In the finite-width setting, the network architectures we consider are quite general.
arXiv Detail & Related papers (2020-12-21T19:32:17Z)
Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features. We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z)
Provable Memorization via Deep Neural Networks using Sub-linear Parameters [91.0268925267129]
It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $O(N2/3)$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points.
arXiv Detail & Related papers (2020-10-26T06:19:38Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH) [2.50686294157537]
Gradient descent yields zero loss in time for deep training networks despite non- infinite nature of the objective function. In this paper, we trained neural dynamics of the NTK for finite width ResNet using Deep Residual Network (ResNet) Our analysis suggests strongly that the particular neural-connection structure ResNet is the main reason for its triumph.
arXiv Detail & Related papers (2020-07-07T18:08:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.