Related papers: Convergence of gradient descent for learning linear neural networks

Convergence of gradient descent for learning linear neural networks

URL: http://arxiv.org/abs/2108.02040v1
Date: Wed, 4 Aug 2021 13:10:30 GMT
Title: Convergence of gradient descent for learning linear neural networks
Authors: Gabin Maxime Nguegnang, Holger Rauhut, Ulrich Terstiege
Abstract summary: We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
Score: 2.209921757303168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.

Related papers

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks [0.0]
We prove directional convergence of network parameters of fixed width leaky ReLU two-layer neural networks optimized by gradient descent with exponential loss.<n>As an application, we demonstrate that benign overfitting occurs with high probability in sub-Gaussian mixture models.
arXiv Detail & Related papers (2025-05-22T04:11:58Z)
Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks [0.0]
We study the gradient descent dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize.
arXiv Detail & Related papers (2025-01-15T20:43:36Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning [1.4050802766699084]
We consider the scenario of supervised learning in Deep Learning (DL) networks. We choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network.
arXiv Detail & Related papers (2023-11-27T02:12:02Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Convergence of gradient descent for deep neural networks [7.360807642941713]
gradient descent has been one of main drivers of the "deep learning revolution" This article presents a new criterion for convergence of gradient descent to a global minimum.
arXiv Detail & Related papers (2022-03-30T17:01:14Z)
Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z)
Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z)
When does gradient descent with logistic loss find interpolating two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z)
Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z)
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.