Criticality versus uniformity in deep neural networks
- URL: http://arxiv.org/abs/2304.04784v1
- Date: Mon, 10 Apr 2023 18:00:00 GMT
- Title: Criticality versus uniformity in deep neural networks
- Authors: Aleksandar Bukva, Jurriaan de Gier, Kevin T. Grosvenor, Ro Jefferson,
Koenraad Schalm, Eliot Schwander
- Abstract summary: Deep feedforward networks along the edge of chaos exhibit exponentially superior training ability as quantified by maximum trainable depth.
In particular, we determine the line of uniformity in phase space along which the post-activation distribution has maximum entropy.
- Score: 52.77024349608834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep feedforward networks initialized along the edge of chaos exhibit
exponentially superior training ability as quantified by maximum trainable
depth. In this work, we explore the effect of saturation of the tanh activation
function along the edge of chaos. In particular, we determine the line of
uniformity in phase space along which the post-activation distribution has
maximum entropy. This line intersects the edge of chaos, and indicates the
regime beyond which saturation of the activation function begins to impede
training efficiency. Our results suggest that initialization along the edge of
chaos is a necessary but not sufficient condition for optimal trainability.
Related papers
- Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment [0.0]
We use an exponential solver to train a neural network without entering the edge of stability.
We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned.
arXiv Detail & Related papers (2024-05-31T18:37:06Z) - Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression.
We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Convergence and Implicit Regularization Properties of Gradient Descent
for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function.
We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Activation function design for deep networks: linearity and effective
initialisation [10.108857371774977]
We study how to avoid two problems at initialisation identified in prior works.
We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin.
arXiv Detail & Related papers (2021-05-17T11:30:46Z) - Eccentric Regularization: Minimizing Hyperspherical Energy without
explicit projection [0.913755431537592]
We introduce a novel regularizing loss function which simulates a pairwise repulsive force between items.
We show that minimizing this loss function in isolation achieves a hyperspherical distribution.
We apply this method of Eccentric Regularization to an autoencoder, and demonstrate its effectiveness in image generation, representation learning and downstream classification tasks.
arXiv Detail & Related papers (2021-04-23T13:55:17Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z) - Over-parameterized Adversarial Training: An Analysis Overcoming the
Curse of Dimensionality [74.0084803220897]
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations.
We show convergence to low robust training loss for emphpolynomial width instead of exponential, under natural assumptions and with the ReLU activation.
arXiv Detail & Related papers (2020-02-16T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.