Generalization Ability of Wide Residual Networks
- URL: http://arxiv.org/abs/2305.18506v1
- Date: Mon, 29 May 2023 15:01:13 GMT
- Title: Generalization Ability of Wide Residual Networks
- Authors: Jianfa Lai, Zixiong Yu, Songtao Tian, Qian Lin
- Abstract summary: We study the generalization ability of the wide residual network on $mathbbSd-1$ with the ReLU activation function.
We show that as the width $mrightarrowinfty$, the residual network kernel uniformly converges to the residual neural tangent kernel (RNTK)
- Score: 5.699259766376014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the generalization ability of the wide residual
network on $\mathbb{S}^{d-1}$ with the ReLU activation function. We first show
that as the width $m\rightarrow\infty$, the residual network kernel (RNK)
uniformly converges to the residual neural tangent kernel (RNTK). This uniform
convergence further guarantees that the generalization error of the residual
network converges to that of the kernel regression with respect to the RNTK. As
direct corollaries, we then show $i)$ the wide residual network with the early
stopping strategy can achieve the minimax rate provided that the target
regression function falls in the reproducing kernel Hilbert space (RKHS)
associated with the RNTK; $ii)$ the wide residual network can not generalize
well if it is trained till overfitting the data. We finally illustrate some
experiments to reconcile the contradiction between our theoretical result and
the widely observed ``benign overfitting phenomenon''
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Improve Generalization Ability of Deep Wide Residual Network with A
Suitable Scaling Factor [0.0]
We show that if $alpha$ is a constant, the class of functions induced by Residual Neural Kernel (RNTK) is not learnable, as the depth goes to infinity.
We also highlight a surprising phenomenon: even if we allow $alpha$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur.
arXiv Detail & Related papers (2024-03-07T14:40:53Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly.
A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network.
Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.