Related papers: Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

URL: http://arxiv.org/abs/2305.08813v2
Date: Mon, 20 Oct 2025 20:34:45 GMT
Title: Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks
Authors: Chaoyue Liu, Han Bi, Like Hui, Xiao Liu,
Abstract summary: We show that ReLU activation helps to improve the worst-case convergence rates of gradient based methods.<n>Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.
Score: 6.399229363353879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.

Related papers

Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data [52.737775129027575]
We show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits Neural Collapse (NC)<n>We reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.
arXiv Detail & Related papers (2025-10-24T01:36:19Z)
A Novel Explanation Against Linear Neural Networks [1.223779595809275]
Linear Regression and neural networks are widely used to model data. We show that neural networks without activation functions, or linear neural networks, actually reduce both training and testing performance. We prove this hypothesis through an analysis of the optimization of an LNN and rigorous testing comparing the performance between both LNNs and linear regression on noisy datasets.
arXiv Detail & Related papers (2023-12-30T09:44:51Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs [63.768739279562105]
We show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set.
arXiv Detail & Related papers (2023-09-26T17:42:52Z)
Using Linear Regression for Iteratively Training Neural Networks [4.873362301533824]
We present a simple linear regression based approach for learning the weights and biases of a neural network. The approach is intended to be to larger, more complex architectures.
arXiv Detail & Related papers (2023-07-11T11:53:25Z)
Nonparametric regression using over-parameterized shallow ReLU neural networks [10.339057554827392]
We show that neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes. It is assumed that the regression function is from the H"older space with smoothness $alpha(d+3)/2$ or a variation space corresponding to shallow neural networks. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks.
arXiv Detail & Related papers (2023-06-14T07:42:37Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Optimal rates of approximation by shallow ReLU$^k$ neural networks and applications to nonparametric regression [12.21422686958087]
We study the approximation capacity of some variation spaces corresponding to shallow ReLU$k$ neural networks. For functions with less smoothness, the approximation rates in terms of the variation norm are established. We show that shallow neural networks can achieve the minimax optimal rates for learning H"older functions.
arXiv Detail & Related papers (2023-04-04T06:35:02Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Linearity Grafting: Relaxed Neuron Pruning Helps Certifiable Robustness [172.61581010141978]
Certifiable robustness is a desirable property for adopting deep neural networks (DNNs) in safety-critical scenarios. We propose a novel solution to strategically manipulate neurons, by "grafting" appropriate levels of linearity.
arXiv Detail & Related papers (2022-06-15T22:42:29Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Optimal Learning Rates of Deep Convolutional Neural Networks: Additive Ridge Functions [19.762318115851617]
We consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates.
arXiv Detail & Related papers (2022-02-24T14:22:32Z)
Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels. We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z)
On the linearity of large non-linear models: when and why the tangent kernel is constant [20.44438519046223]
We shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network.
arXiv Detail & Related papers (2020-10-02T16:44:45Z)
Measuring Model Complexity of Neural Networks with Curve Activation Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function. We experimentally explore the training process of neural networks and detect overfitting. We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
Deep Neural Networks with Trainable Activations and Controlled Lipschitz Constant [26.22495169129119]
We introduce a variational framework to learn the activation functions of deep neural networks. Our aim is to increase the capacity of the network while controlling an upper-bound of the Lipschitz constant. We numerically compare our scheme with standard ReLU network and its variations, PReLU and LeakyReLU.
arXiv Detail & Related papers (2020-01-17T12:32:55Z)
LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks [14.943863837083496]
We propose a Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs) by scaling the Tanh linearly. We observe the superior performance using Multi-layer Perceptron (MLP), Residual Network (ResNet) and Long-short term memory (LSTM) for data classification, image classification and tweets classification tasks.
arXiv Detail & Related papers (2019-01-01T02:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.