On Convergence of Training Loss Without Reaching Stationary Points
- URL: http://arxiv.org/abs/2110.06256v1
- Date: Tue, 12 Oct 2021 18:12:23 GMT
- Title: On Convergence of Training Loss Without Reaching Stationary Points
- Authors: Jingzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie
- Abstract summary: We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
- Score: 62.41370821014218
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: It is a well-known fact that nonconvex optimization is computationally
intractable in the worst case. As a result, theoretical analysis of
optimization algorithms such as gradient descent often focuses on local
convergence to stationary points where the gradient norm is zero or negligible.
In this work, we examine the disconnect between the existing theoretical
analysis of gradient-based algorithms and actual practice. Specifically, we
provide numerical evidence that in large-scale neural network training, such as
in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network
weight variables do not converge to stationary points where the gradient of the
loss function vanishes. Remarkably, however, we observe that while weights do
not converge to stationary points, the value of the loss function converges.
Inspired by this observation, we propose a new perspective based on ergodic
theory of dynamical systems. We prove convergence of the distribution of weight
values to an approximate invariant measure (without smoothness assumptions)
that explains this phenomenon. We further discuss how this perspective can
better align the theory with empirical observations.
Related papers
- On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Gradient is All You Need? [0.0]
In this paper we provide a novel analytical perspective on the theoretical understanding of learning algorithms by interpreting consensus-based gradient-based optimization (CBO)
Our results prove the intrinsic power of CBO to alleviate the complexities of the nonlocal landscape function.
arXiv Detail & Related papers (2023-06-16T11:30:55Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z) - The Convex Geometry of Backpropagation: Neural Network Gradient Flows
Converge to Extreme Points of the Dual Convex Program [26.143558180103334]
We study non- subgradient flows for two-layer ReLULU networks from a convex implicit geometry and duality perspective.
We show that we can identify the problem of non- subgradient descent via primal-dual correspondence.
arXiv Detail & Related papers (2021-10-13T04:17:08Z) - Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks.
We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z) - On dissipative symplectic integration with applications to
gradient-based optimization [77.34726150561087]
We propose a geometric framework in which discretizations can be realized systematically.
We show that a generalization of symplectic to nonconservative and in particular dissipative Hamiltonian systems is able to preserve rates of convergence up to a controlled error.
arXiv Detail & Related papers (2020-04-15T00:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.