Related papers: Implicit bias of deep linear networks in the large learning rate phase

Implicit bias of deep linear networks in the large learning rate phase

URL: http://arxiv.org/abs/2011.12547v2
Date: Wed, 16 Dec 2020 13:38:29 GMT
Title: Implicit bias of deep linear networks in the large learning rate phase
Authors: Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu
Abstract summary: We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
Score: 15.846533303963229
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most theoretical studies explaining the regularization effect in deep learning have only focused on gradient descent with a sufficient small learning rate or even gradient flow (infinitesimal learning rate). Such researches, however, have neglected a reasonably large learning rate applied in most practical applications. In this work, we characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. [26] in a regression setting with squared loss. They found a learning rate regime with a large stepsize named the catapult phase, where the loss grows at the early stage of training and eventually converges to a minimum that is flatter than those found in the small learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.

Related papers

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling [11.168336416219857]
Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates.<n>We show that this discrepancy is not fully explained by finite-width phenomena such as catapult effects.<n>We validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss.
arXiv Detail & Related papers (2025-05-28T15:40:48Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Catapult Dynamics and Phase Transitions in Quadratic Nets [10.32543637637479]
We will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. We show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
arXiv Detail & Related papers (2023-01-18T19:03:48Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
On the Origin of Implicit Regularization in Stochastic Gradient Descent [22.802683068658897]
gradient descent (SGD) follows the path of gradient flow on the full batch loss function. We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite. We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
arXiv Detail & Related papers (2021-01-28T18:32:14Z)
When does gradient descent with logistic loss find interpolating two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z)
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.