Identifying good directions to escape the NTK regime and efficiently
learn low-degree plus sparse polynomials
- URL: http://arxiv.org/abs/2206.03688v1
- Date: Wed, 8 Jun 2022 06:06:51 GMT
- Title: Identifying good directions to escape the NTK regime and efficiently
learn low-degree plus sparse polynomials
- Authors: Eshaan Nichani, Yu Bai, Jason D. Lee
- Abstract summary: We show that a wide two-layer neural network can jointly use the Tangent Kernel (NTK) and the QuadNTK to fit target functions.
This yields an end to end convergence and guarantee with provable sample improvement over both the NTK and QuadNTK on their own.
- Score: 52.11466135206223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent goal in the theory of deep learning is to identify how neural
networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime,
where the network is coupled with its first order Taylor expansion at
initialization. While the NTK is minimax optimal for learning dense polynomials
(Ghorbani et al, 2021), it cannot learn features, and hence has poor sample
complexity for learning many classes of functions including sparse polynomials.
Recent works have thus aimed to identify settings where gradient based
algorithms provably generalize better than the NTK. One such example is the
"QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term
in the Taylor expansion. Bai and Lee (2020) show that the second-order term can
learn sparse polynomials efficiently; however, it sacrifices the ability to
learn general dense polynomials.
In this paper, we analyze how gradient descent on a two-layer neural network
can escape the NTK regime by utilizing a spectral characterization of the NTK
(Montanari and Zhong, 2020) and building on the QuadNTK approach. We first
expand upon the spectral analysis to identify "good" directions in parameter
space in which we can move without harming generalization. Next, we show that a
wide two-layer neural network can jointly use the NTK and QuadNTK to fit target
functions consisting of a dense low-degree term and a sparse high-degree term
-- something neither the NTK nor the QuadNTK can do on their own. Finally, we
construct a regularizer which encourages our parameter vector to move in the
"good" directions, and show that gradient descent on the regularized loss will
converge to a global minimizer, which also has low test error. This yields an
end to end convergence and generalization guarantee with provable sample
complexity improvement over both the NTK and QuadNTK on their own.
Related papers
- Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization [14.186776881154127]
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks.
We show that the NTK is well conditioned in a challenging sub-linear setup.
Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
arXiv Detail & Related papers (2022-05-20T14:50:24Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.