Understanding Sparse Feature Updates in Deep Networks using Iterative
Linearisation
- URL: http://arxiv.org/abs/2211.12345v4
- Date: Thu, 12 Oct 2023 15:44:42 GMT
- Title: Understanding Sparse Feature Updates in Deep Networks using Iterative
Linearisation
- Authors: Adrian Goldwaser and Hong Ge
- Abstract summary: We derive an iterative linearised training method as a novel empirical tool to investigate why larger and deeper networks generalise well.
We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training.
We also show that feature learning is essential for good performance.
- Score: 2.33877878310217
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Larger and deeper networks generalise well despite their increased capacity
to overfit. Understanding why this happens is theoretically and practically
important. One recent approach looks at the infinitely wide limits of such
networks and their corresponding kernels. However, these theoretical tools
cannot fully explain finite networks as the empirical kernel changes
significantly during gradient-descent-based training in contrast to infinite
networks. In this work, we derive an iterative linearised training method as a
novel empirical tool to further investigate this distinction, allowing us to
control for sparse (i.e. infrequent) feature updates and quantify the frequency
of feature learning needed to achieve comparable performance. We justify
iterative linearisation as an interpolation between a finite analog of the
infinite width regime, which does not learn features, and standard gradient
descent training, which does. Informally, we also show that it is analogous to
a damped version of the Gauss-Newton algorithm -- a second-order method. We
show that in a variety of cases, iterative linearised training surprisingly
performs on par with standard training, noting in particular how much less
frequent feature learning is required to achieve comparable performance. We
also show that feature learning is essential for good performance. Since such
feature learning inevitably causes changes in the NTK kernel, we provide direct
negative evidence for the NTK theory, which states the NTK kernel remains
constant during training.
Related papers
- Efficient kernel surrogates for neural network-based regression [0.8030359871216615]
We study the performance of the Conjugate Kernel (CK), an efficient approximation to the Neural Tangent Kernel (NTK)
We show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior.
In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively.
arXiv Detail & Related papers (2023-10-28T06:41:47Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - Rapid Feature Evolution Accelerates Learning in Neural Networks [2.538209532048867]
We analyze the phenomenon of kernel alignment of the NTK with the target functions during gradient descent.
We show that feature evolution is faster and more dramatic in deeper networks.
We also found that networks with multiple output nodes develop separate, specialized kernels for each output channel.
arXiv Detail & Related papers (2021-05-29T13:50:03Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - When and why PINNs fail to train: A neural tangent kernel perspective [2.1485350418225244]
We derive the Neural Tangent Kernel (NTK) of PINNs and prove that, under appropriate conditions, it converges to a deterministic kernel that stays constant during training in the infinite-width limit.
We find a remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error.
We propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to adaptively calibrate the convergence rate of the total training error.
arXiv Detail & Related papers (2020-07-28T23:44:56Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.