Related papers: Gradient Descent in Neural Networks as Sequential Learning in RKBS

Gradient Descent in Neural Networks as Sequential Learning in RKBS

URL: http://arxiv.org/abs/2302.00205v1
Date: Wed, 1 Feb 2023 03:18:07 GMT
Title: Gradient Descent in Neural Networks as Sequential Learning in RKBS
Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh
Abstract summary: We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
Score: 63.011641517977644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.

Related papers

Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology. We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK) This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z)
How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime. We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z)
Connecting NTK and NNGP: A Unified Theoretical Framework for Neural Network Learning Dynamics in the Kernel Regime [7.136205674624813]
We provide a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit. We identify two learning phases characterized by different time scales: gradient-driven and diffusive learning.
arXiv Detail & Related papers (2023-09-08T18:00:01Z)
Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network. We provide analytical expressions for these speed limits for linear and linearizable neural networks. Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Why Quantization Improves Generalization: NTK of Binary Weight Neural Networks [33.08636537654596]
We take the binary weights in a neural network as random variables under rounding, and study the distribution propagation over different layers in the neural network. We propose a quasi neural network to approximate the distribution propagation, which is a neural network with continuous parameters and smooth activation function.
arXiv Detail & Related papers (2022-06-13T06:11:21Z)
A Convergence Analysis of Nesterov's Accelerated Gradient Method in Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory. We show that the convergence of the random number and $kappaO can converge to the global minimum. We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.