Improve Generalization Ability of Deep Wide Residual Network with A
Suitable Scaling Factor
- URL: http://arxiv.org/abs/2403.04545v1
- Date: Thu, 7 Mar 2024 14:40:53 GMT
- Title: Improve Generalization Ability of Deep Wide Residual Network with A
Suitable Scaling Factor
- Authors: Songtao Tian, Zixiong Yu
- Abstract summary: We show that if $alpha$ is a constant, the class of functions induced by Residual Neural Kernel (RNTK) is not learnable, as the depth goes to infinity.
We also highlight a surprising phenomenon: even if we allow $alpha$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Residual Neural Networks (ResNets) have demonstrated remarkable success
across a wide range of real-world applications. In this paper, we identify a
suitable scaling factor (denoted by $\alpha$) on the residual branch of deep
wide ResNets to achieve good generalization ability. We show that if $\alpha$
is a constant, the class of functions induced by Residual Neural Tangent Kernel
(RNTK) is asymptotically not learnable, as the depth goes to infinity. We also
highlight a surprising phenomenon: even if we allow $\alpha$ to decrease with
increasing depth $L$, the degeneration phenomenon may still occur. However,
when $\alpha$ decreases rapidly with $L$, the kernel regression with deep RNTK
with early stopping can achieve the minimax rate provided that the target
regression function falls in the reproducing kernel Hilbert space associated
with the infinite-depth RNTK. Our simulation studies on synthetic data and real
classification tasks such as MNIST, CIFAR10 and CIFAR100 support our
theoretical criteria for choosing $\alpha$.
Related papers
- Deep ReLU networks -- injectivity capacity upper bounds [0.0]
We study deep ReLU feed forward neural networks (NNs) and their injectivity abilities.
For any given hidden layers architecture, it is defined as the minimal ratio between number of network's outputs and inputs.
A strong recent progress in precisely studying single ReLU layer injectivity properties is here moved to a deep network level.
arXiv Detail & Related papers (2024-12-27T14:57:40Z) - Generalization Ability of Wide Residual Networks [5.699259766376014]
We study the generalization ability of the wide residual network on $mathbbSd-1$ with the ReLU activation function.
We show that as the width $mrightarrowinfty$, the residual network kernel uniformly converges to the residual neural tangent kernel (RNTK)
arXiv Detail & Related papers (2023-05-29T15:01:13Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Scalable Lipschitz Residual Networks with Convex Potential Flows [120.27516256281359]
We show that using convex potentials in a residual network gradient flow provides a built-in $1$-Lipschitz transformation.
A comprehensive set of experiments on CIFAR-10 demonstrates the scalability of our architecture and the benefit of our approach for $ell$ provable defenses.
arXiv Detail & Related papers (2021-10-25T07:12:53Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - Online Limited Memory Neural-Linear Bandits with Likelihood Matching [53.18698496031658]
We study neural-linear bandits for solving problems where both exploration and representation learning play an important role.
We propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
arXiv Detail & Related papers (2021-02-07T14:19:07Z) - Towards an Understanding of Residual Networks Using Neural Tangent
Hierarchy (NTH) [2.50686294157537]
Gradient descent yields zero loss in time for deep training networks despite non- infinite nature of the objective function.
In this paper, we trained neural dynamics of the NTK for finite width ResNet using Deep Residual Network (ResNet)
Our analysis suggests strongly that the particular neural-connection structure ResNet is the main reason for its triumph.
arXiv Detail & Related papers (2020-07-07T18:08:16Z) - On Approximation Capabilities of ReLU Activation and Softmax Output
Layer in Neural Networks [6.852561400929072]
We prove that a sufficiently large neural network using the ReLU activation function can approximate any function in $L1$ up to any arbitrary precision.
We also show that a large enough neural network using a nonlinear softmax output layer can also approximate any indicator function in $L1$.
arXiv Detail & Related papers (2020-02-10T19:48:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.