Related papers: Deep Neural Network Initialization with Sparsity Inducing Activations

Deep Neural Network Initialization with Sparsity Inducing Activations

URL: http://arxiv.org/abs/2402.16184v1
Date: Sun, 25 Feb 2024 20:11:40 GMT
Title: Deep Neural Network Initialization with Sparsity Inducing Activations
Authors: Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones, Jared Tanner
Abstract summary: We use the large width Gaussian process limit to analyze the behaviour of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification. We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map.
Score: 5.437298646956505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU ($\phi(x)=\max(0, x-\tau)$ for $\tau\ge 0$) and soft thresholding ($\phi(x)=0$ for $|x|\le\tau$ and $x-\text{sign}(x)\tau$ for $|x|>\tau$). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85\% while retaining close to full accuracy.

Related papers

Curse of Dimensionality in Neural Network Optimization [6.460951804337735]
curse of dimensionality in neural network optimization under the mean-field regime is studied. It is established that the curse of dimensionality persists when a locally Lipschitz continuous activation function is employed.
arXiv Detail & Related papers (2025-02-07T22:21:31Z)
Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z)
Efficient uniform approximation using Random Vector Functional Link networks [0.0]
A Random Vector Functional Link (RVFL) network is a depth-2 neural network with random inner nodes and biases. We show that an RVFL with ReLU activation can approximate the Lipschitz target function. Our method of proof is rooted in theory and harmonic analysis.
arXiv Detail & Related papers (2023-06-30T09:25:03Z)
Wide neural networks: From non-gaussian random fields at initialization to the NTK geometry of training [0.0]
Recent developments in applications of artificial neural networks with over $n=1014$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n to +infty$ limit of such networks. In this work we will study their behavior for large, but finite $n$.
arXiv Detail & Related papers (2023-04-06T21:34:13Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications. We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
On Lipschitz Regularization of Convolutional Layers using Toeplitz Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning. computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard. We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z)
Fractional moment-preserving initialization schemes for training deep neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution. We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z)
Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.