Chaotic Regularization and Heavy-Tailed Limits for Deterministic
Gradient Descent
- URL: http://arxiv.org/abs/2205.11361v1
- Date: Mon, 23 May 2022 14:47:55 GMT
- Title: Chaotic Regularization and Heavy-Tailed Limits for Deterministic
Gradient Descent
- Authors: Soon Hoe Lim, Yijun Wan, Umut \c{S}im\c{s}ekli
- Abstract summary: gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior.
In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD)
MPGD is a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system.
- Score: 4.511923587827301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have shown that gradient descent (GD) can achieve improved
generalization when its dynamics exhibits a chaotic behavior. However, to
obtain the desired effect, the step-size should be chosen sufficiently large, a
task which is problem dependent and can be difficult in practice. In this
study, we incorporate a chaotic component to GD in a controlled manner, and
introduce multiscale perturbed GD (MPGD), a novel optimization framework where
the GD recursion is augmented with chaotic perturbations that evolve via an
independent dynamical system. We analyze MPGD from three different angles: (i)
By building up on recent advances in rough paths theory, we show that, under
appropriate assumptions, as the step-size decreases, the MPGD recursion
converges weakly to a stochastic differential equation (SDE) driven by a
heavy-tailed L\'evy-stable process. (ii) By making connections to recently
developed generalization bounds for heavy-tailed processes, we derive a
generalization bound for the limiting SDE and relate the worst-case
generalization error over the trajectories of the process to the parameters of
MPGD. (iii) We analyze the implicit regularization effect brought by the
dynamical regularization and show that, in the weak perturbation regime, MPGD
introduces terms that penalize the Hessian of the loss function. Empirical
results are provided to demonstrate the advantages of MPGD.
Related papers
- Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large
Stepsizes and Edge of Stability [29.65065635952276]
We investigate the impact of gradientity and large stepsizes on the implicit regularisation of descent (GD) and gradient descent (SGD) over diagonal linear networks.
We show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD.
arXiv Detail & Related papers (2023-02-17T16:37:08Z) - OrthoReg: Improving Graph-regularized MLPs via Orthogonality
Regularization [66.30021126251725]
Graph Neural Networks (GNNs) are currently dominating in modeling graphstructure data.
Graph-regularized networks (GR-MLPs) implicitly inject the graph structure information into model weights, while their performance can hardly match that of GNNs in most tasks.
We show that GR-MLPs suffer from dimensional collapse, a phenomenon in which the largest a few eigenvalues dominate the embedding space.
We propose OrthoReg, a novel GR-MLP model to mitigate the dimensional collapse issue.
arXiv Detail & Related papers (2023-01-31T21:20:48Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.