Spectral Bias Outside the Training Set for Deep Networks in the Kernel
Regime
- URL: http://arxiv.org/abs/2206.02927v1
- Date: Mon, 6 Jun 2022 22:09:15 GMT
- Title: Spectral Bias Outside the Training Set for Deep Networks in the Kernel
Regime
- Authors: Benjamin Bowman, Guido Montufar
- Abstract summary: We show that the network is biased to learn the top eigenfunctions of the Neural Kernel not just on the training set but over the entire input space.
This bias depends on the model architecture and input distribution alone.
We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide quantitative bounds measuring the $L^2$ difference in function
space between the trajectory of a finite-width network trained on finitely many
samples from the idealized kernel dynamics of infinite width and infinite data.
An implication of the bounds is that the network is biased to learn the top
eigenfunctions of the Neural Tangent Kernel not just on the training set but
over the entire input space. This bias depends on the model architecture and
input distribution alone and thus does not depend on the target function which
does not need to be in the RKHS of the kernel. The result is valid for deep
architectures with fully connected, convolutional, and residual layers.
Furthermore the width does not need to grow polynomially with the number of
samples in order to obtain high probability bounds up to a stopping time. The
proof exploits the low-effective-rank property of the Fisher Information Matrix
at initialization, which implies a low effective dimension of the model (far
smaller than the number of parameters). We conclude that local capacity control
from the low effective rank of the Fisher Information Matrix is still
underexplored theoretically.
Related papers
- Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond [0.0]
We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error.<n>We show that each coordinate of gradient-based feature vector admits a marginal normal approximation, thereby supporting the validity of FDR control.
arXiv Detail & Related papers (2025-12-04T11:46:06Z) - Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for
Deep ReLU Networks [21.13299067136635]
We provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU networks.
In the finite-width setting, the network architectures we consider are quite general.
arXiv Detail & Related papers (2020-12-21T19:32:17Z) - Kernelized Classification in Deep Networks [49.47339560731506]
We propose a kernelized classification layer for deep networks.
We advocate a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training.
We show the usefulness of the proposed nonlinear classification layer on several datasets and tasks.
arXiv Detail & Related papers (2020-12-08T21:43:19Z) - Learning Deep ReLU Networks Is Fixed-Parameter Tractable [21.625005195943707]
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs.
We give an algorithm whose running time is a fixed weights in the ambient dimension.
Our bounds depend on the number of hidden units, depth, spectral norm of the spectral norm, and Lipschitz constant.
arXiv Detail & Related papers (2020-09-28T17:58:43Z) - How Powerful are Shallow Neural Networks with Bandlimited Random
Weights? [25.102870584507244]
We investigate the expressive power of limited depth-2 band random neural networks.
A random net is a neural network where the hidden layer parameters are frozen with random bandwidth.
arXiv Detail & Related papers (2020-08-19T13:26:12Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.