A proof of convergence for gradient descent in the training of
artificial neural networks for constant target functions
- URL: http://arxiv.org/abs/2102.09924v1
- Date: Fri, 19 Feb 2021 13:33:03 GMT
- Title: A proof of convergence for gradient descent in the training of
artificial neural networks for constant target functions
- Authors: Patrick Cheridito, Arnulf Jentzen, Adrian Riekert, Florian Rossmannek
- Abstract summary: We show that the risk function of the gradient descent method does indeed converge to zero.
A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
- Score: 3.4792548480344254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient descent optimization algorithms are the standard ingredients that
are used to train artificial neural networks (ANNs). Even though a huge number
of numerical simulations indicate that gradient descent optimization methods do
indeed convergence in the training of ANNs, until today there is no rigorous
theoretical analysis which proves (or disproves) this conjecture. In
particular, even in the case of the most basic variant of gradient descent
optimization algorithms, the plain vanilla gradient descent method, it remains
an open problem to prove or disprove the conjecture that gradient descent
converges in the training of ANNs. In this article we solve this problem in the
special situation where the target function under consideration is a constant
function. More specifically, in the case of constant target functions we prove
in the training of rectified fully-connected feedforward ANNs with one-hidden
layer that the risk function of the gradient descent method does indeed
converge to zero. Our mathematical analysis strongly exploits the property that
the rectifier function is the activation function used in the considered ANNs.
A key contribution of this work is to explicitly specify a Lyapunov function
for the gradient flow system of the ANN parameters. This Lyapunov function is
the central tool in our convergence proof of the gradient descent method.
Related papers
- A generalized neural tangent kernel for surrogate gradient learning [2.048226951354646]
We provide a generalization of the neural tangent kernel (NTK) that enables the analysis of surrogate gradient learning (SGL)
We numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK.
arXiv Detail & Related papers (2024-05-24T13:27:23Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis
Function Decomposition [4.01776052820812]
This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition.
We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis.
arXiv Detail & Related papers (2022-10-01T19:15:40Z) - Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs)
In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z) - A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions [3.198144010381572]
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-08-10T12:01:37Z) - q-RBFNN:A Quantum Calculus-based RBF Neural Network [31.14412266444568]
A gradient descent based learning approach for the radial basis function neural networks (RBFNN) is proposed.
The proposed method is based on the q-gradient which is also known as Jackson derivative.
The proposed $q$-RBFNN is analyzed for its convergence performance in the context of least square algorithm.
arXiv Detail & Related papers (2021-06-02T08:27:12Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.