A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks
- URL: http://arxiv.org/abs/2206.02001v4
- Date: Wed, 12 Jun 2024 15:15:37 GMT
- Title: A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks
- Authors: Yuxin Sun, Dong Lao, Ganesh Sundaramoorthi, Anthony Yezzi,
- Abstract summary: We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
- Score: 12.355137704908042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
Related papers
- Physics-informed deep learning and compressive collocation for high-dimensional diffusion-reaction equations: practical existence theory and numerics [5.380276949049726]
We develop and analyze an efficient high-dimensional Partial Differential Equations solver based on Deep Learning (DL)
We show, both theoretically and numerically, that it can compete with a novel stable and accurate compressive spectral collocation method.
arXiv Detail & Related papers (2024-06-03T17:16:11Z) - Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data [3.686808512438363]
Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods.
We propose three stable classes of Neural SDEs: Langevin-type SDE, Linear Noise SDE, and Geometric SDE.
Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.
arXiv Detail & Related papers (2024-02-22T22:00:03Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Accurate and Reliable Forecasting using Stochastic Differential
Equations [48.21369419647511]
It is critical yet challenging for deep learning models to properly characterize uncertainty that is pervasive in real-world environments.
This paper develops SDE-HNN to characterize the interaction between the predictive mean and variance of HNNs for accurate and reliable regression.
Experiments on the challenging datasets show that our method significantly outperforms the state-of-the-art baselines in terms of both predictive performance and uncertainty quantification.
arXiv Detail & Related papers (2021-03-28T04:18:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.