(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large
Stepsizes and Edge of Stability
- URL: http://arxiv.org/abs/2302.08982v2
- Date: Wed, 25 Oct 2023 16:09:19 GMT
- Title: (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large
Stepsizes and Edge of Stability
- Authors: Mathieu Even, Scott Pesme, Suriya Gunasekar, Nicolas Flammarion
- Abstract summary: We investigate the impact of gradientity and large stepsizes on the implicit regularisation of descent (GD) and gradient descent (SGD) over diagonal linear networks.
We show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD.
- Score: 29.65065635952276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the impact of stochasticity and large stepsizes
on the implicit regularisation of gradient descent (GD) and stochastic gradient
descent (SGD) over diagonal linear networks. We prove the convergence of GD and
SGD with macroscopic stepsizes in an overparametrised regression setting and
characterise their solutions through an implicit regularisation problem. Our
crisp characterisation leads to qualitative insights about the impact of
stochasticity and stepsizes on the recovered solution. Specifically, we show
that large stepsizes consistently benefit SGD for sparse regression problems,
while they can hinder the recovery of sparse solutions for GD. These effects
are magnified for stepsizes in a tight window just below the divergence
threshold, in the "edge of stability" regime. Our findings are supported by
experimental results.
Related papers
- A Precise Characterization of SGD Stability Using Loss Surface Geometry [8.942671556572073]
Descent Gradient (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding.
Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates.
arXiv Detail & Related papers (2024-01-22T19:46:30Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly.
The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Stochastic gradient descent introduces an effective landscape-dependent
regularization favoring flat solutions [5.022507593837554]
Generalization is one of the most important problems in deep learning (DL)
There exist many low-loss solutions that fit the training data equally well.
The key question is which solution is more generalizable.
arXiv Detail & Related papers (2022-06-02T18:49:36Z) - Chaotic Regularization and Heavy-Tailed Limits for Deterministic
Gradient Descent [4.511923587827301]
gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior.
In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD)
MPGD is a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system.
arXiv Detail & Related papers (2022-05-23T14:47:55Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.