Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity
- URL: http://arxiv.org/abs/2106.09524v1
- Date: Thu, 17 Jun 2021 14:16:04 GMT
- Title: Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity
- Authors: Scott Pesme, Loucas Pillaud-Vivien and Nicolas Flammarion
- Abstract summary: We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
- Score: 24.428843425522107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the implicit bias of training algorithms is of crucial
importance in order to explain the success of overparametrised neural networks.
In this paper, we study the dynamics of stochastic gradient descent over
diagonal linear networks through its continuous time version, namely stochastic
gradient flow. We explicitly characterise the solution chosen by the stochastic
flow and prove that it always enjoys better generalisation properties than that
of gradient flow. Quite surprisingly, we show that the convergence speed of the
training loss controls the magnitude of the biasing effect: the slower the
convergence, the better the bias. To fully complete our analysis, we provide
convergence guarantees for the dynamics. We also give experimental results
which support our theoretical claims. Our findings highlight the fact that
structured noise can induce better generalisation and they help explain the
greater performances observed in practice of stochastic gradient descent over
gradient descent.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - On the Overlooked Structure of Stochastic Gradients [34.650998241703626]
We show that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and gradient noise caused by minibatch training usually do not exhibit power-law heavy tails.
Our work challenges the existing belief and provides novel insights on the structure of gradients in deep learning.
arXiv Detail & Related papers (2022-12-05T07:55:22Z) - Label noise (stochastic) gradient descent implicitly solves the Lasso
for quadratic parametrisation [14.244787327283335]
We study the role of the label noise in the training dynamics of a quadratically parametrised model through its continuous time version.
Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of dynamics as observed in practice.
arXiv Detail & Related papers (2022-06-20T15:24:42Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z) - A Random Matrix Theory Approach to Damping in Deep Learning [0.7614628596146599]
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise.
We develop a novel random matrix theory based damping learner for second order optimiser inspired by linear shrinkage estimation.
arXiv Detail & Related papers (2020-11-15T18:19:42Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.