Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts
Generalization
- URL: http://arxiv.org/abs/2012.14193v1
- Date: Mon, 28 Dec 2020 11:17:46 GMT
- Title: Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts
Generalization
- Authors: Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo Kerg,
Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, Krzysztof Geras
- Abstract summary: We show that gradient descent implicitly penalizes the trace of the Fisher Information Matrix from the beginning of training.
We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training.
- Score: 111.57403811375484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The early phase of training has been shown to be important in two ways for
deep neural networks. First, the degree of regularization in this phase
significantly impacts the final generalization. Second, it is accompanied by a
rapid change in the local loss curvature influenced by regularization choices.
Connecting these two findings, we show that stochastic gradient descent (SGD)
implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the
beginning of training. We argue it is an implicit regularizer in SGD by showing
that explicitly penalizing the trace of the FIM can significantly improve
generalization. We further show that the early value of the trace of the FIM
correlates strongly with the final generalization. We highlight that in the
absence of implicit or explicit regularization, the trace of the FIM can
increase to a large value early in training, to which we refer as catastrophic
Fisher explosion. Finally, to gain insight into the regularization effect of
penalizing the trace of the FIM, we show that 1) it limits memorization by
reducing the learning speed of examples with noisy labels more than that of the
clean examples, and 2) trajectories with a low initial trace of the FIM end in
flat minima, which are commonly associated with good generalization.
Related papers
- Early Period of Training Impacts Out-of-Distribution Generalization [56.283944756315066]
We investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training.
We show that selecting the number of trainable parameters at different times during training has a minuscule impact on ID results.
The absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization.
arXiv Detail & Related papers (2024-03-22T13:52:53Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Implicit bias of deep linear networks in the large learning rate phase [15.846533303963229]
We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime.
We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
arXiv Detail & Related papers (2020-11-25T06:50:30Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.