Gradient Descent in Neural Networks as Sequential Learning in RKBS
- URL: http://arxiv.org/abs/2302.00205v1
- Date: Wed, 1 Feb 2023 03:18:07 GMT
- Title: Gradient Descent in Neural Networks as Sequential Learning in RKBS
- Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh
- Abstract summary: We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
- Score: 63.011641517977644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The study of Neural Tangent Kernels (NTKs) has provided much needed insight
into convergence and generalization properties of neural networks in the
over-parametrized (wide) limit by approximating the network using a first-order
Taylor expansion with respect to its weights in the neighborhood of their
initialization values. This allows neural network training to be analyzed from
the perspective of reproducing kernel Hilbert spaces (RKHS), which is
informative in the over-parametrized regime, but a poor approximation for
narrower networks as the weights change more during training. Our goal is to
extend beyond the limits of NTK toward a more general theory. We construct an
exact power-series representation of the neural network in a finite
neighborhood of the initial weights as an inner product of two feature maps,
respectively from data and weight-step space, to feature space, allowing neural
network training to be analyzed from the perspective of reproducing kernel {\em
Banach} space (RKBS). We prove that, regardless of width, the training sequence
produced by gradient descent can be exactly replicated by regularized
sequential learning in RKBS. Using this, we present novel bound on uniform
convergence where the iterations count and learning rate play a central role,
giving new theoretical insight into neural network training.
Related papers
- Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology.
We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK)
This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Connecting NTK and NNGP: A Unified Theoretical Framework for Neural
Network Learning Dynamics in the Kernel Regime [7.136205674624813]
We provide a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit.
We identify two learning phases characterized by different time scales: gradient-driven and diffusive learning.
arXiv Detail & Related papers (2023-09-08T18:00:01Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.