Sharper analysis of sparsely activated wide neural networks with
trainable biases
- URL: http://arxiv.org/abs/2301.00327v1
- Date: Sun, 1 Jan 2023 02:11:39 GMT
- Title: Sharper analysis of sparsely activated wide neural networks with
trainable biases
- Authors: Hongru Yang, Ziyu Jiang, Ruizhe Zhang, Zhangyang Wang, Yingbin Liang
- Abstract summary: This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime.
Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network.
Since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK, this work further studies the least eigenvalue of the limiting NTK.
- Score: 103.85569570164404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies training one-hidden-layer overparameterized ReLU networks
via gradient descent in the neural tangent kernel (NTK) regime, where,
differently from the previous works, the networks' biases are trainable and are
initialized to some constant rather than zero. The first set of results of this
work characterize the convergence of the network's gradient descent dynamics.
Surprisingly, it is shown that the network after sparsification can achieve as
fast convergence as the original network. The contribution over previous work
is that not only the bias is allowed to be updated by gradient descent under
our setting but also a finer analysis is given such that the required width to
ensure the network's closeness to its NTK is improved. Secondly, the networks'
generalization bound after training is provided. A width-sparsity dependence is
presented which yields sparsity-dependent localized Rademacher complexity and a
generalization bound matching previous analysis (up to logarithmic factors). As
a by-product, if the bias initialization is chosen to be zero, the width
requirement improves the previous bound for the shallow networks'
generalization. Lastly, since the generalization bound has dependence on the
smallest eigenvalue of the limiting NTK and the bounds from previous works
yield vacuous generalization, this work further studies the least eigenvalue of
the limiting NTK. Surprisingly, while it is not shown that trainable biases are
necessary, trainable bias helps to identify a nice data-dependent region where
a much finer analysis of the NTK's smallest eigenvalue can be conducted, which
leads to a much sharper lower bound than the previously known worst-case bound
and, consequently, a non-vacuous generalization bound.
Related papers
- Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z) - Principles for Initialization and Architecture Selection in Graph Neural
Networks with ReLU Activations [17.51364577113718]
We show three principles for architecture selection in finite width graph neural networks (GNNs) with ReLU activations.
First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization.
Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator.
arXiv Detail & Related papers (2023-06-20T16:40:41Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly.
A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network.
Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.