Related papers: Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

URL: http://arxiv.org/abs/2305.13064v1
Date: Mon, 22 May 2023 14:27:27 GMT
Title: Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond
Authors: Itai Kreisler, Mor Shpigel Nacson, Daniel Soudry, Yair Carmon
Abstract summary: We find that when Gradient Descent is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its ''Edge of Stability'' (EoS)
Score: 30.545436106324203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research shows that when Gradient Descent (GD) is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its ''Edge of Stability'' (EoS). Here, we find a quantity that does decrease monotonically throughout GD training: the sharpness attained by the gradient flow solution (GFS)-the solution that would be obtained if, from now until convergence, we train with an infinitesimal step size. Theoretically, we analyze scalar neural networks with the squared loss, perhaps the simplest setting where the EoS phenomena still occur. In this model, we prove that the GFS sharpness decreases monotonically. Using this result, we characterize settings where GD provably converges to the EoS in scalar networks. Empirically, we show that GD monotonically decreases the GFS sharpness in a squared regression model as well as practical neural network architectures.

Related papers

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks [0.0]
We study the gradient descent dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize.
arXiv Detail & Related papers (2025-01-15T20:43:36Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Law of Balance and Stationary Distribution of Stochastic Gradient Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs [19.401271427657395]
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
arXiv Detail & Related papers (2022-06-02T09:01:25Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
The Dynamics of Gradient Descent for Overparametrized Neural Networks [19.11271777632797]
We show that the dynamics of neural network weights under GD converge to a point which is close to the minimum norm solution. To illustrate the application of this result, we show that the GD converges to a gradient function that generalizes well.
arXiv Detail & Related papers (2021-05-13T22:20:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.