Effect of the initial configuration of weights on the training and
function of artificial neural networks
- URL: http://arxiv.org/abs/2012.02550v1
- Date: Fri, 4 Dec 2020 12:13:12 GMT
- Title: Effect of the initial configuration of weights on the training and
function of artificial neural networks
- Authors: R. J. Jesus, M. L. Antunes, R. A. da Costa, S. N. Dorogovtsev, J. F.
F. Mendes, R. L. Aguiar
- Abstract summary: We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU networks of various sizes trained via Gradient Descent.
We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights.
Our results suggest that SGD's ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The function and performance of neural networks is largely determined by the
evolution of their weights and biases in the process of training, starting from
the initial configuration of these parameters to one of the local minima of the
loss function. We perform the quantitative statistical characterization of the
deviation of the weights of two-hidden-layer ReLU networks of various sizes
trained via Stochastic Gradient Descent (SGD) from their initial random
configuration. We compare the evolution of the distribution function of this
deviation with the evolution of the loss during training. We observed that
successful training via SGD leaves the network in the close neighborhood of the
initial configuration of its weights. For each initial weight of a link we
measured the distribution function of the deviation from this value after
training and found how the moments of this distribution and its peak depend on
the initial weight. We explored the evolution of these deviations during
training and observed an abrupt increase within the overfitting region. This
jump occurs simultaneously with a similarly abrupt increase recorded in the
evolution of the loss function. Our results suggest that SGD's ability to
efficiently find local minima is restricted to the vicinity of the random
initial configuration of weights.
Related papers
- Enhancing Convergence Speed with Feature-Enforcing Physics-Informed Neural Networks: Utilizing Boundary Conditions as Prior Knowledge for Faster Convergence [0.0]
This study introduces an accelerated training method for Vanilla Physics-Informed-Neural-Networks (PINN)
It addresses three factors that imbalance the loss function: initial weight state of a neural network, domain to boundary points ratio, and loss weighting factor.
It is found that incorporating weights generated in the first training phase into the structure of a neural network neutralizes the effects of imbalance factors.
arXiv Detail & Related papers (2023-08-17T09:10:07Z) - Scaling and Resizing Symmetry in Feedforward Networks [0.0]
We show that the scaling property exhibited by physical systems at criticality, is also present in untrained feedforward networks with random weights at the critical line.
We suggest an additional data-resizing symmetry, which is directly inherited from the scaling symmetry at criticality.
arXiv Detail & Related papers (2023-06-26T18:55:54Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Early Stage Convergence and Global Convergence of Training Mildly
Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast.
We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.