A convergence result of a continuous model of deep learning via Łojasiewicz--Simon inequality
- URL: http://arxiv.org/abs/2311.15365v2
- Date: Sun, 14 Apr 2024 05:39:11 GMT
- Title: A convergence result of a continuous model of deep learning via Łojasiewicz--Simon inequality
- Authors: Noboru Isobe,
- Abstract summary: This focuses on a Wasserstein-type flow, which represents a process process of a continuous model of a Deep Neural Network (DNN)
First, we establish the existence ar for an average loss of the model underL2regularization.
We show existence of slope of of the loss as time as time as time of flow- optimization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study focuses on a Wasserstein-type gradient flow, which represents an optimization process of a continuous model of a Deep Neural Network (DNN). First, we establish the existence of a minimizer for an average loss of the model under $L^2$-regularization. Subsequently, we show the existence of a curve of maximal slope of the loss. Our main result is the convergence of flow to a critical point of the loss as time goes to infinity. An essential aspect of proving this result involves the establishment of the \L{}ojasiewicz--Simon gradient inequality for the loss. We derive this inequality by assuming the analyticity of NNs and loss functions. Our proofs offer a new approach for analyzing the asymptotic behavior of Wasserstein-type gradient flows for nonconvex functionals.
Related papers
- SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures [3.3123773366516645]
We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions.<n>We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an critical value.
arXiv Detail & Related papers (2025-05-14T17:15:11Z) - Non-asymptotic convergence analysis of the stochastic gradient
Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with
applications to training of ReLU neural networks [8.058385158111207]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.
To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Critical Points and Convergence Analysis of Generative Deep Linear
Networks Trained with Bures-Wasserstein Loss [2.294014185517203]
We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance.
We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices.
We establish convergence results for gradient flow using a smooth perturbative version of the loss and convergence results for finite step size gradient descent.
arXiv Detail & Related papers (2023-03-06T10:56:14Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Asymptotic consistency of the WSINDy algorithm in the limit of continuum
data [0.0]
We study the consistency of the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy)
We provide a mathematically rigorous explanation for the observed robustness to noise of weak-form equation learning.
arXiv Detail & Related papers (2022-11-29T07:49:34Z) - Global Convergence of Over-parameterized Deep Equilibrium Models [52.65330015267245]
A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection.
Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation.
We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
arXiv Detail & Related papers (2022-05-27T08:00:13Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Finite Sample Analysis of Minimax Offline Reinforcement Learning:
Completeness, Fast Rates and First-Order Efficiency [83.02999769628593]
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning.
We show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions.
We present the first finite-sample result with first-order efficiency in non-tabular environments.
arXiv Detail & Related papers (2021-02-05T03:20:39Z) - Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks.
We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.