Activation function design for deep networks: linearity and effective
initialisation
- URL: http://arxiv.org/abs/2105.07741v1
- Date: Mon, 17 May 2021 11:30:46 GMT
- Title: Activation function design for deep networks: linearity and effective
initialisation
- Authors: Michael Murray, Vinayak Abrol, Jared Tanner
- Abstract summary: We study how to avoid two problems at initialisation identified in prior works.
We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin.
- Score: 10.108857371774977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The activation function deployed in a deep neural network has great influence
on the performance of the network at initialisation, which in turn has
implications for training. In this paper we study how to avoid two problems at
initialisation identified in prior works: rapid convergence of pairwise input
correlations, and vanishing and exploding gradients. We prove that both these
problems can be avoided by choosing an activation function possessing a
sufficiently large linear region around the origin, relative to the bias
variance $\sigma_b^2$ of the network's random initialisation. We demonstrate
empirically that using such activation functions leads to tangible benefits in
practice, both in terms test and training accuracy as well as training time.
Furthermore, we observe that the shape of the nonlinear activation outside the
linear region appears to have a relatively limited impact on training. Finally,
our results also allow us to train networks in a new hyperparameter regime,
with a much larger bias variance than has previously been possible.
Related papers
- Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training [1.7205106391379021]
A neural network with ReLU activations may be viewed as a composition of piecewise linear functions.
We introduce a novel training strategy that forces the network to exhibit a number of linear regions exponential in depth.
This approach allows us to learn approximations of convex, one-dimensional functions that are several orders of magnitude more accurate than their randomly counterparts.
arXiv Detail & Related papers (2023-11-29T19:09:48Z) - ENN: A Neural Network with DCT Adaptive Activation Functions [2.2713084727838115]
We present Expressive Neural Network (ENN), a novel model in which the non-linear activation functions are modeled using the Discrete Cosine Transform (DCT)
This parametrization keeps the number of trainable parameters low, is appropriate for gradient-based schemes, and adapts to different learning tasks.
The performance of ENN outperforms state of the art benchmarks, providing above a 40% gap in accuracy in some scenarios.
arXiv Detail & Related papers (2023-07-02T21:46:30Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Nonlinear Advantage: Trained Networks Might Not Be As Complex as You
Think [0.0]
We investigate how much we can simplify the network function towards linearity before performance collapses.
We find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance.
Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width.
arXiv Detail & Related papers (2022-11-30T17:24:14Z) - Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic [137.04558017227583]
Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years.
We take a mean-field perspective on the evolution and convergence of feature-based neural AC.
We prove that neural AC finds the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2021-12-27T06:09:50Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.