Deep Neural Network Initialization with Sparsity Inducing Activations
- URL: http://arxiv.org/abs/2402.16184v1
- Date: Sun, 25 Feb 2024 20:11:40 GMT
- Title: Deep Neural Network Initialization with Sparsity Inducing Activations
- Authors: Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones,
Jared Tanner
- Abstract summary: We use the large width Gaussian process limit to analyze the behaviour of nonlinear activations that induce sparsity in the hidden outputs.
A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification.
We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map.
- Score: 5.437298646956505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inducing and leveraging sparse activations during training and inference is a
promising avenue for improving the computational efficiency of deep networks,
which is increasingly important as network sizes continue to grow and their
application becomes more widespread. Here we use the large width Gaussian
process limit to analyze the behaviour, at random initialization, of nonlinear
activations that induce sparsity in the hidden outputs. A previously unreported
form of training instability is proven for arguably two of the most natural
candidates for hidden layer sparsification; those being a shifted ReLU
($\phi(x)=\max(0, x-\tau)$ for $\tau\ge 0$) and soft thresholding ($\phi(x)=0$
for $|x|\le\tau$ and $x-\text{sign}(x)\tau$ for $|x|>\tau$). We show that this
instability is overcome by clipping the nonlinear activation magnitude, at a
level prescribed by the shape of the associated Gaussian process variance map.
Numerical experiments verify the theory and show that the proposed magnitude
clipped sparsifying activations can be trained with training and test
fractional sparsity as high as 85\% while retaining close to full accuracy.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Efficient uniform approximation using Random Vector Functional Link
networks [0.0]
A Random Vector Functional Link (RVFL) network is a depth-2 neural network with random inner nodes and biases.
We show that an RVFL with ReLU activation can approximate the Lipschitz target function.
Our method of proof is rooted in theory and harmonic analysis.
arXiv Detail & Related papers (2023-06-30T09:25:03Z) - Wide neural networks: From non-gaussian random fields at initialization
to the NTK geometry of training [0.0]
Recent developments in applications of artificial neural networks with over $n=1014$ parameters make it extremely important to study the large $n$ behaviour of such networks.
Most works studying wide neural networks have focused on the infinite width $n to +infty$ limit of such networks.
In this work we will study their behavior for large, but finite $n$.
arXiv Detail & Related papers (2023-04-06T21:34:13Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Training Certifiably Robust Neural Networks with Efficient Local
Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications.
We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - On Lipschitz Regularization of Convolutional Layers using Toeplitz
Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning.
computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard.
We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z) - Fractional moment-preserving initialization schemes for training deep
neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.