NTK-SAP: Improving neural network pruning by aligning training dynamics
- URL: http://arxiv.org/abs/2304.02840v1
- Date: Thu, 6 Apr 2023 03:10:03 GMT
- Title: NTK-SAP: Improving neural network pruning by aligning training dynamics
- Authors: Yite Wang, Dawei Li, Ruoyu Sun
- Abstract summary: Recent advances in neural kernel (NTK) theory suggest that the training dynamics of large enough neural networks is closely related to the spectrum of the NTK.
We propose to prune the connections that have the least influence on the spectrum of the NTK.
We name our foresight pruning algorithm Neural Kernel Spectrum-Aware Pruning (NTK-SAP)
- Score: 13.887349224871045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pruning neural networks before training has received increasing interest due
to its potential to reduce training time and memory. One popular method is to
prune the connections based on a certain metric, but it is not entirely clear
what metric is the best choice. Recent advances in neural tangent kernel (NTK)
theory suggest that the training dynamics of large enough neural networks is
closely related to the spectrum of the NTK. Motivated by this finding, we
propose to prune the connections that have the least influence on the spectrum
of the NTK. This method can help maintain the NTK spectrum, which may help
align the training dynamics to that of its dense counterpart. However, one
possible issue is that the fixed-weight-NTK corresponding to a given initial
point can be very different from the NTK corresponding to later iterates during
the training phase. We further propose to sample multiple realizations of
random weights to estimate the NTK spectrum. Note that our approach is
weight-agnostic, which is different from most existing methods that are
weight-dependent. In addition, we use random inputs to compute the
fixed-weight-NTK, making our method data-agnostic as well. We name our
foresight pruning algorithm Neural Tangent Kernel Spectrum-Aware Pruning
(NTK-SAP). Empirically, our method achieves better performance than all
baselines on multiple datasets.
Related papers
- Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning [14.792099973449794]
We propose an algorithm to align the training dynamics of the sparse network with that of the dense one.
We show how the usually neglected data-dependent component in the NTK's spectrum can be taken into account.
Path eXclusion (PX) is able to find lottery tickets even at high sparsity levels.
arXiv Detail & Related papers (2024-06-03T22:19:42Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Understanding Sparse Feature Updates in Deep Networks using Iterative
Linearisation [2.33877878310217]
We derive an iterative linearised training method as a novel empirical tool to investigate why larger and deeper networks generalise well.
We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training.
We also show that feature learning is essential for good performance.
arXiv Detail & Related papers (2022-11-22T15:34:59Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - When and why PINNs fail to train: A neural tangent kernel perspective [2.1485350418225244]
We derive the Neural Tangent Kernel (NTK) of PINNs and prove that, under appropriate conditions, it converges to a deterministic kernel that stays constant during training in the infinite-width limit.
We find a remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error.
We propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to adaptively calibrate the convergence rate of the total training error.
arXiv Detail & Related papers (2020-07-28T23:44:56Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.