Second-order regression models exhibit progressive sharpening to the
edge of stability
- URL: http://arxiv.org/abs/2210.04860v1
- Date: Mon, 10 Oct 2022 17:21:20 GMT
- Title: Second-order regression models exhibit progressive sharpening to the
edge of stability
- Authors: Atish Agarwala, Fabian Pedregosa, and Jeffrey Pennington
- Abstract summary: We show that for quadratic objectives in two dimensions, a second-order regression model exhibits progressive sharpening towards a value that differs slightly from the edge of stability.
In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network.
- Score: 30.92413051155244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies of gradient descent with large step sizes have shown that
there is often a regime with an initial increase in the largest eigenvalue of
the loss Hessian (progressive sharpening), followed by a stabilization of the
eigenvalue near the maximum value which allows convergence (edge of stability).
These phenomena are intrinsically non-linear and do not happen for models in
the constant Neural Tangent Kernel (NTK) regime, for which the predictive
function is approximately linear in the parameters. As such, we consider the
next simplest class of predictive models, namely those that are quadratic in
the parameters, which we call second-order regression models. For quadratic
objectives in two dimensions, we prove that this second-order regression model
exhibits progressive sharpening of the NTK eigenvalue towards a value that
differs slightly from the edge of stability, which we explicitly compute. In
higher dimensions, the model generically shows similar behavior, even without
the specific structure of a neural network, suggesting that progressive
sharpening and edge-of-stability behavior aren't unique features of neural
networks, and could be a more general property of discrete learning algorithms
in high-dimensional non-linear models.
Related papers
- Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit [1.7597525104451157]
An empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE)
Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs)
We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings.
arXiv Detail & Related papers (2024-06-11T03:07:41Z) - Neural Abstractions [72.42530499990028]
We present a novel method for the safety verification of nonlinear dynamical models that uses neural networks to represent abstractions of their dynamics.
We demonstrate that our approach performs comparably to the mature tool Flow* on existing benchmark nonlinear models.
arXiv Detail & Related papers (2023-01-27T12:38:09Z) - Linear Stability Hypothesis and Rank Stratification for Nonlinear Models [3.0041514772139166]
We propose a rank stratification for general nonlinear models to uncover a model rank as an "effective size of parameters"
By these results, model rank of a target function predicts a minimal training data size for its successful recovery.
arXiv Detail & Related papers (2022-11-21T16:27:25Z) - Improving Generalization via Uncertainty Driven Perturbations [107.45752065285821]
We consider uncertainty-driven perturbations of the training data points.
Unlike loss-driven perturbations, uncertainty-guided perturbations do not cross the decision boundary.
We show that UDP is guaranteed to achieve the robustness margin decision on linear models.
arXiv Detail & Related papers (2022-02-11T16:22:08Z) - Time varying regression with hidden linear dynamics [74.9914602730208]
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system.
Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates.
arXiv Detail & Related papers (2021-12-29T23:37:06Z) - Stabilizing Equilibrium Models by Jacobian Regularization [151.78151873928027]
Deep equilibrium networks (DEQs) are a new class of models that eschews traditional depth in favor of finding the fixed point of a single nonlinear layer.
We propose a regularization scheme for DEQ models that explicitly regularizes the Jacobian of the fixed-point update equations to stabilize the learning of equilibrium models.
We show that this regularization adds only minimal computational cost, significantly stabilizes the fixed-point convergence in both forward and backward passes, and scales well to high-dimensional, realistic domains.
arXiv Detail & Related papers (2021-06-28T00:14:11Z) - Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption.
They can suffer from ill-posedness and convergence instability.
This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - A Convex Parameterization of Robust Recurrent Neural Networks [3.2872586139884623]
Recurrent neural networks (RNNs) are a class of nonlinear dynamical systems often used to model sequence-to-sequence maps.
We formulate convex sets of RNNs with stability and robustness guarantees.
arXiv Detail & Related papers (2020-04-11T03:12:42Z) - Dimension Independent Generalization Error by Stochastic Gradient
Descent [12.474236773219067]
We present a theory on the generalization error of descent (SGD) solutions for both and locally convex loss functions.
We show that the generalization error does not depend on the $p$ dimension or depends on the low effective $p$logarithmic factor.
arXiv Detail & Related papers (2020-03-25T03:08:41Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.