ReLU soothes the NTK condition number and accelerates optimization for
wide neural networks
- URL: http://arxiv.org/abs/2305.08813v1
- Date: Mon, 15 May 2023 17:22:26 GMT
- Title: ReLU soothes the NTK condition number and accelerates optimization for
wide neural networks
- Authors: Chaoyue Liu, Like Hui
- Abstract summary: We show that ReLU leads to: it better separation for similar data, and it better conditioning of neural tangent kernel (NTK)
Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate.
- Score: 9.374151703899047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rectified linear unit (ReLU), as a non-linear activation function, is well
known to improve the expressivity of neural networks such that any continuous
function can be approximated to arbitrary precision by a sufficiently wide
neural network. In this work, we present another interesting and important
feature of ReLU activation function. We show that ReLU leads to: {\it better
separation} for similar data, and {\it better conditioning} of neural tangent
kernel (NTK), which are closely related. Comparing with linear neural networks,
we show that a ReLU activated wide neural network at random initialization has
a larger angle separation for similar data in the feature space of model
gradient, and has a smaller condition number for NTK. Note that, for a linear
neural network, the data separation and NTK condition number always remain the
same as in the case of a linear model. Furthermore, we show that a deeper ReLU
network (i.e., with more ReLU activation operations), has a smaller NTK
condition number than a shallower one. Our results imply that ReLU activation,
as well as the depth of ReLU network, helps improve the gradient descent
convergence rate, which is closely related to the NTK condition number.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Fixing the NTK: From Neural Network Linearizations to Exact Convex
Programs [63.768739279562105]
We show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data.
A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set.
arXiv Detail & Related papers (2023-09-26T17:42:52Z) - Using Linear Regression for Iteratively Training Neural Networks [4.873362301533824]
We present a simple linear regression based approach for learning the weights and biases of a neural network.
The approach is intended to be to larger, more complex architectures.
arXiv Detail & Related papers (2023-07-11T11:53:25Z) - Nonparametric regression using over-parameterized shallow ReLU neural networks [10.339057554827392]
We show that neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes.
It is assumed that the regression function is from the H"older space with smoothness $alpha(d+3)/2$ or a variation space corresponding to shallow neural networks.
As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks.
arXiv Detail & Related papers (2023-06-14T07:42:37Z) - Optimal rates of approximation by shallow ReLU$^k$ neural networks and
applications to nonparametric regression [12.21422686958087]
We study the approximation capacity of some variation spaces corresponding to shallow ReLU$k$ neural networks.
For functions with less smoothness, the approximation rates in terms of the variation norm are established.
We show that shallow neural networks can achieve the minimax optimal rates for learning H"older functions.
arXiv Detail & Related papers (2023-04-04T06:35:02Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Optimal Learning Rates of Deep Convolutional Neural Networks: Additive
Ridge Functions [19.762318115851617]
We consider the mean squared error analysis for deep convolutional neural networks.
We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates.
arXiv Detail & Related papers (2022-02-24T14:22:32Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.