Related papers: Divergence of Empirical Neural Tangent Kernel in Classification Problems

Divergence of Empirical Neural Tangent Kernel in Classification Problems

URL: http://arxiv.org/abs/2504.11130v1
Date: Tue, 15 Apr 2025 12:30:21 GMT
Title: Divergence of Empirical Neural Tangent Kernel in Classification Problems
Authors: Zixiong Yu, Songtao Tian, Guhan Chen,
Abstract summary: In classification problems, fully connected neural networks (FCNs) and residual neural networks (ResNets) cannot be approximated by kernel logistic regression based on the Neural Tangent Kernel (NTK)<n>We show that the empirical NTK does not uniformly converge to the NTK across all times on the training samples as the network width increases.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper demonstrates that in classification problems, fully connected neural networks (FCNs) and residual neural networks (ResNets) cannot be approximated by kernel logistic regression based on the Neural Tangent Kernel (NTK) under overtraining (i.e., when training time approaches infinity). Specifically, when using the cross-entropy loss, regardless of how large the network width is (as long as it is finite), the empirical NTK diverges from the NTK on the training samples as training time increases. To establish this result, we first demonstrate the strictly positive definiteness of the NTKs for multi-layer FCNs and ResNets. Then, we prove that during training, % with the cross-entropy loss, the neural network parameters diverge if the smallest eigenvalue of the empirical NTK matrix (Gram matrix) with respect to training samples is bounded below by a positive constant. This behavior contrasts sharply with the lazy training regime commonly observed in regression problems. Consequently, using a proof by contradiction, we show that the empirical NTK does not uniformly converge to the NTK across all times on the training samples as the network width increases. We validate our theoretical results through experiments on both synthetic data and the MNIST classification task. This finding implies that NTK theory is not applicable in this context, with significant theoretical implications for understanding neural networks in classification problems.

Related papers

Just One Layer Norm Guarantees Stable Extrapolation [18.1154945039478]
We prove general results by applying Neural Kernel Tangent (NTK) theory to analyse infinitely-wide neural networks trained until convergence.<n>We show that the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data.<n>We explore real-world implications, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.
arXiv Detail & Related papers (2025-05-20T15:39:27Z)
Issues with Neural Tangent Kernel Approach to Neural Networks [13.710104651002869]
We revisit the derivation of the NTK and conduct numerical experiments to evaluate this equivalence theorem.<n>We observe that adding a layer to a neural network and the corresponding updated NTK do not yield matching changes in the predictor error.<n>These observations suggest the equivalence theorem does not hold well in practice and puts into question whether neural tangent kernels adequately address the training process of neural networks.
arXiv Detail & Related papers (2025-01-19T03:21:06Z)
Efficient kernel surrogates for neural network-based regression [0.8030359871216615]
We study the performance of the Conjugate Kernel (CK), an efficient approximation to the Neural Tangent Kernel (NTK) We show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior. In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively.
arXiv Detail & Related papers (2023-10-28T06:41:47Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp) In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks. We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z)
Limitations of the NTK for Understanding Generalization in Deep Learning [13.44676002603497]
We study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. We show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling.
arXiv Detail & Related papers (2022-06-20T21:23:28Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
Rethinking Influence Functions of Neural Networks in the Over-parameterized Regime [12.501827786462924]
influence function (IF) is designed to measure the effect of removing a single training point on neural networks. We use the neural tangent kernel (NTK) theory to calculate IF for the neural network trained with regularized mean-square loss.
arXiv Detail & Related papers (2021-12-15T17:44:00Z)
Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z)
When and why PINNs fail to train: A neural tangent kernel perspective [2.1485350418225244]
We derive the Neural Tangent Kernel (NTK) of PINNs and prove that, under appropriate conditions, it converges to a deterministic kernel that stays constant during training in the infinite-width limit. We find a remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error. We propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to adaptively calibrate the convergence rate of the total training error.
arXiv Detail & Related papers (2020-07-28T23:44:56Z)
Feature Purification: How Adversarial Training Performs Robust Deep Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z)
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.