Related papers: On Convergence of Training Loss Without Reaching Stationary Points

On Convergence of Training Loss Without Reaching Stationary Points

URL: http://arxiv.org/abs/2110.06256v1
Date: Tue, 12 Oct 2021 18:12:23 GMT
Title: On Convergence of Training Loss Without Reaching Stationary Points
Authors: Jingzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie
Abstract summary: We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes. We propose a new perspective based on ergodic theory dynamical systems.
Score: 62.41370821014218
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes. Remarkably, however, we observe that while weights do not converge to stationary points, the value of the loss function converges. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems. We prove convergence of the distribution of weight values to an approximate invariant measure (without smoothness assumptions) that explains this phenomenon. We further discuss how this perspective can better align the theory with empirical observations.

Related papers

On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used. We provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Gradient is All You Need? [0.0]
In this paper we provide a novel analytical perspective on the theoretical understanding of learning algorithms by interpreting consensus-based gradient-based optimization (CBO) Our results prove the intrinsic power of CBO to alleviate the complexities of the nonlocal landscape function.
arXiv Detail & Related papers (2023-06-16T11:30:55Z)
Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations. DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z)
On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point. Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z)
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD) We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs) We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z)
The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program [26.143558180103334]
We study non- subgradient flows for two-layer ReLULU networks from a convex implicit geometry and duality perspective. We show that we can identify the problem of non- subgradient descent via primal-dual correspondence.
arXiv Detail & Related papers (2021-10-13T04:17:08Z)
Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks. We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z)
On dissipative symplectic integration with applications to gradient-based optimization [77.34726150561087]
We propose a geometric framework in which discretizations can be realized systematically. We show that a generalization of symplectic to nonconservative and in particular dissipative Hamiltonian systems is able to preserve rates of convergence up to a controlled error.
arXiv Detail & Related papers (2020-04-15T00:36:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.