Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy
- URL: http://arxiv.org/abs/2007.06738v1
- Date: Mon, 13 Jul 2020 23:49:53 GMT
- Title: Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy
- Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee,
Nathan Srebro, Daniel Soudry
- Abstract summary: We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
- Score: 71.25689267025244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a detailed asymptotic study of gradient flow trajectories and
their implicit optimization bias when minimizing the exponential loss over
"diagonal linear networks". This is the simplest model displaying a transition
between "kernel" and non-kernel ("rich" or "active") regimes. We show how the
transition is controlled by the relationship between the initialization scale
and how accurately we minimize the training loss. Our results indicate that
some limit behaviors of gradient descent only kick in at ridiculous training
accuracies (well beyond $10^{-100}$). Moreover, the implicit bias at reasonable
initialization scales and training accuracies is more complex and not captured
by these limits.
Related papers
- Deep linear networks for regression are implicitly regularized towards flat minima [4.806579822134391]
Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one.
We show a lower bound on the sharpness of minimizers, which grows linearly with depth.
We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound.
arXiv Detail & Related papers (2024-05-22T08:58:51Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature.
This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum.
We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Cost Function Unrolling in Unsupervised Optical Flow [6.656273171776146]
This work focuses on the derivation of the Total Variation semi-norm commonly used in unsupervised cost functions.
We derive a differentiable proxy to the hard L1 smoothness constraint in a novel iterative scheme which we refer to as Cost Unrolling.
arXiv Detail & Related papers (2020-11-30T14:10:03Z) - Implicit bias of deep linear networks in the large learning rate phase [15.846533303963229]
We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime.
We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
arXiv Detail & Related papers (2020-11-25T06:50:30Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.