Implicit bias of deep linear networks in the large learning rate phase
- URL: http://arxiv.org/abs/2011.12547v2
- Date: Wed, 16 Dec 2020 13:38:29 GMT
- Title: Implicit bias of deep linear networks in the large learning rate phase
- Authors: Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu
- Abstract summary: We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime.
We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
- Score: 15.846533303963229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most theoretical studies explaining the regularization effect in deep
learning have only focused on gradient descent with a sufficient small learning
rate or even gradient flow (infinitesimal learning rate). Such researches,
however, have neglected a reasonably large learning rate applied in most
practical applications. In this work, we characterize the implicit bias effect
of deep linear networks for binary classification using the logistic loss in
the large learning rate regime, inspired by the seminal work by Lewkowycz et
al. [26] in a regression setting with squared loss. They found a learning rate
regime with a large stepsize named the catapult phase, where the loss grows at
the early stage of training and eventually converges to a minimum that is
flatter than those found in the small learning rate regime. We claim that
depending on the separation conditions of data, the gradient descent iterates
will converge to a flatter minimum in the catapult phase. We rigorously prove
this claim under the assumption of degenerate data by overcoming the difficulty
of the non-constant Hessian of logistic loss and further characterize the
behavior of loss and Hessian for non-separable data. Finally, we demonstrate
that flatter minima in the space spanned by non-separable data along with the
learning rate in the catapult phase can lead to better generalization
empirically.
Related papers
- On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Catapult Dynamics and Phase Transitions in Quadratic Nets [10.32543637637479]
We will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets.
We show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large.
We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
arXiv Detail & Related papers (2023-01-18T19:03:48Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Origin of Implicit Regularization in Stochastic Gradient Descent [22.802683068658897]
gradient descent (SGD) follows the path of gradient flow on the full batch loss function.
We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite.
We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
arXiv Detail & Related papers (2021-01-28T18:32:14Z) - When does gradient descent with logistic loss find interpolating
two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.