Sharper analysis of sparsely activated wide neural networks with
trainable biases
- URL: http://arxiv.org/abs/2301.00327v1
- Date: Sun, 1 Jan 2023 02:11:39 GMT
- Title: Sharper analysis of sparsely activated wide neural networks with
trainable biases
- Authors: Hongru Yang, Ziyu Jiang, Ruizhe Zhang, Zhangyang Wang, Yingbin Liang
- Abstract summary: This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime.
Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network.
Since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK, this work further studies the least eigenvalue of the limiting NTK.
- Score: 103.85569570164404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies training one-hidden-layer overparameterized ReLU networks
via gradient descent in the neural tangent kernel (NTK) regime, where,
differently from the previous works, the networks' biases are trainable and are
initialized to some constant rather than zero. The first set of results of this
work characterize the convergence of the network's gradient descent dynamics.
Surprisingly, it is shown that the network after sparsification can achieve as
fast convergence as the original network. The contribution over previous work
is that not only the bias is allowed to be updated by gradient descent under
our setting but also a finer analysis is given such that the required width to
ensure the network's closeness to its NTK is improved. Secondly, the networks'
generalization bound after training is provided. A width-sparsity dependence is
presented which yields sparsity-dependent localized Rademacher complexity and a
generalization bound matching previous analysis (up to logarithmic factors). As
a by-product, if the bias initialization is chosen to be zero, the width
requirement improves the previous bound for the shallow networks'
generalization. Lastly, since the generalization bound has dependence on the
smallest eigenvalue of the limiting NTK and the bounds from previous works
yield vacuous generalization, this work further studies the least eigenvalue of
the limiting NTK. Surprisingly, while it is not shown that trainable biases are
necessary, trainable bias helps to identify a nice data-dependent region where
a much finer analysis of the NTK's smallest eigenvalue can be conducted, which
leads to a much sharper lower bound than the previously known worst-case bound
and, consequently, a non-vacuous generalization bound.
Related papers
- Infinite Width Limits of Self Supervised Neural Networks [6.178817969919849]
We bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss.
We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity.
arXiv Detail & Related papers (2024-11-17T21:13:57Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a
Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp)
In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks.
We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z) - Limitations of the NTK for Understanding Generalization in Deep Learning [13.44676002603497]
We study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization.
We show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling.
arXiv Detail & Related papers (2022-06-20T21:23:28Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.