Gradient Flow Equations for Deep Linear Neural Networks: A Survey from a Network Perspective
- URL: http://arxiv.org/abs/2511.10362v1
- Date: Fri, 14 Nov 2025 01:47:23 GMT
- Title: Gradient Flow Equations for Deep Linear Neural Networks: A Survey from a Network Perspective
- Authors: Joel Wendin, Claudio Altafini,
- Abstract summary: The paper surveys recent progresses in understanding the dynamics and loss landscape of the gradient flow equations associated to deep linear neural networks.<n>The loss landscape is characterized by infinitely many global minima and saddle points, both strict and nonstrict, but lacks local minima and maxima.<n>The adjacency matrix representation we use in the paper allows to highlight the existence of a quotient space structure.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paper surveys recent progresses in understanding the dynamics and loss landscape of the gradient flow equations associated to deep linear neural networks, i.e., the gradient descent training dynamics (in the limit when the step size goes to 0) of deep neural networks missing the activation functions and subject to quadratic loss functions. When formulated in terms of the adjacency matrix of the neural network, as we do in the paper, these gradient flow equations form a class of converging matrix ODEs which is nilpotent, polynomial, isospectral, and with conservation laws. The loss landscape is described in detail. It is characterized by infinitely many global minima and saddle points, both strict and nonstrict, but lacks local minima and maxima. The loss function itself is a positive semidefinite Lyapunov function for the gradient flow, and its level sets are unbounded invariant sets of critical points, with critical values that correspond to the amount of singular values of the input-output data learnt by the gradient along a certain trajectory. The adjacency matrix representation we use in the paper allows to highlight the existence of a quotient space structure in which each critical value of the loss function is represented only once, while all other critical points with the same critical value belong to the fiber associated to the quotient space. It also allows to easily determine stable and unstable submanifolds at the saddle points, even when the Hessian fails to obtain them.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient descent provably escapes saddle points in the training of shallow ReLU networks [6.458742319938318]
We prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements.
Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks, we show that gradient descents most saddle points.
arXiv Detail & Related papers (2022-08-03T14:08:52Z) - Gradient flow dynamics of shallow ReLU networks for square loss and
orthogonal inputs [19.401271427657395]
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution.
This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
arXiv Detail & Related papers (2022-06-02T09:01:25Z) - Convergence and Implicit Regularization Properties of Gradient Descent
for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function.
We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z) - Deep Learning Approximation of Diffeomorphisms via Linear-Control
Systems [91.3755431537592]
We consider a control system of the form $dot x = sum_i=1lF_i(x)u_i$, with linear dependence in the controls.
We use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points.
arXiv Detail & Related papers (2021-10-24T08:57:46Z) - Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear
Classification [3.158346511479111]
We propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions.
We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum.
arXiv Detail & Related papers (2020-11-23T07:50:09Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z) - Ill-Posedness and Optimization Geometry for Nonlinear Neural Network
Training [4.7210697296108926]
We show that the nonlinear activation functions used in the network construction play a critical role in classifying stationary points of the loss landscape.
For shallow dense networks, the nonlinear activation function determines the Hessian nullspace in the vicinity of global minima.
We extend these results to deep dense neural networks, showing that the last activation function plays an important role in classifying stationary points.
arXiv Detail & Related papers (2020-02-07T16:33:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.