Related papers: A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

URL: http://arxiv.org/abs/2102.09924v1
Date: Fri, 19 Feb 2021 13:33:03 GMT
Title: A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions
Authors: Patrick Cheridito, Arnulf Jentzen, Adrian Riekert, Florian Rossmannek
Abstract summary: We show that the risk function of the gradient descent method does indeed converge to zero. A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
Score: 3.4792548480344254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient descent optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). Even though a huge number of numerical simulations indicate that gradient descent optimization methods do indeed convergence in the training of ANNs, until today there is no rigorous theoretical analysis which proves (or disproves) this conjecture. In particular, even in the case of the most basic variant of gradient descent optimization algorithms, the plain vanilla gradient descent method, it remains an open problem to prove or disprove the conjecture that gradient descent converges in the training of ANNs. In this article we solve this problem in the special situation where the target function under consideration is a constant function. More specifically, in the case of constant target functions we prove in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the gradient descent method does indeed converge to zero. Our mathematical analysis strongly exploits the property that the rectifier function is the activation function used in the considered ANNs. A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters. This Lyapunov function is the central tool in our convergence proof of the gradient descent method.

Related papers

Mathematical analysis of the gradients in deep learning [3.3123773366516645]
We show that a gradient function must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable. We conclude that the generalized gradient function must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.
arXiv Detail & Related papers (2025-01-26T19:11:57Z)
A generalized neural tangent kernel for surrogate gradient learning [2.048226951354646]
We provide a generalization of the neural tangent kernel (NTK) that enables the analysis of surrogate gradient learning (SGL) We numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK.
arXiv Detail & Related papers (2024-05-24T13:27:23Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z)
Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis Function Decomposition [4.01776052820812]
This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition. We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis.
arXiv Detail & Related papers (2022-10-01T19:15:40Z)
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z)
A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions [3.198144010381572]
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-08-10T12:01:37Z)
q-RBFNN:A Quantum Calculus-based RBF Neural Network [31.14412266444568]
A gradient descent based learning approach for the radial basis function neural networks (RBFNN) is proposed. The proposed method is based on the q-gradient which is also known as Jackson derivative. The proposed $q$-RBFNN is analyzed for its convergence performance in the context of least square algorithm.
arXiv Detail & Related papers (2021-06-02T08:27:12Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate. We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.