Related papers: Mathematical analysis of the gradients in deep learning

Mathematical analysis of the gradients in deep learning

URL: http://arxiv.org/abs/2501.15646v1
Date: Sun, 26 Jan 2025 19:11:57 GMT
Title: Mathematical analysis of the gradients in deep learning
Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Frederic Weber,
Abstract summary: We show that a gradient function must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.<n>We conclude that the generalized gradient function must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.
Score: 3.3123773366516645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-linear functions with multidimensional versions of so-called activation functions. One of the most popular of such activation functions is the rectified linear unit (ReLU) function $\mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R}$. The ReLU function is, however, not differentiable and, typically, this lack of regularity transfers to the cost function of the supervised learning problem under consideration. Regardless of this lack of differentiability issue, deep learning practioners apply SGD methods based on suitably generalized gradients in standard deep learning libraries like {\sc TensorFlow} or {\sc Pytorch}. In this work we reveal an accurate and concise mathematical description of such generalized gradients in the training of deep fully-connected feedforward ANNs and we also study the resulting generalized gradient function analytically. Specifically, we provide an appropriate approximation procedure that uniquely describes the generalized gradient function, we prove that the generalized gradients are limiting Fr\'echet subgradients of the cost functional, and we conclude that the generalized gradients must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.

Related papers

Extended convexity and smoothness and their applications in deep learning [5.281849820329249]
This paper introduces an optimization framework aimed at providing a theoretical foundation for a class of composite optimization problems, particularly those in deep learning.<n>We analyze the smoothness of Lipschitz's concepts of Lipschitz's descent and descent methods for objective functions that are $mathcalH(Phi)$-smoothness.
arXiv Detail & Related papers (2024-10-08T08:40:07Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold [59.73080197971106]
This paper presents a first-order conjugate optimization method that converges faster than the steepest descent method. It aims to achieve global convergence over the Stiefel manifold.
arXiv Detail & Related papers (2023-08-21T08:02:16Z)
Continuous Function Structured in Multilayer Perceptron for Global Optimization [0.0]
gradient information of multilayer perceptron with a linear neuron is modified with functional derivative for benchmarking global minimum search problems. We show that the landscape of the gradient derived from given continuous function using functional derivative can be a form with ax+b neurons.
arXiv Detail & Related papers (2023-03-07T14:50:50Z)
Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis Function Decomposition [4.01776052820812]
This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition. We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis.
arXiv Detail & Related papers (2022-10-01T19:15:40Z)
Learning Globally Smooth Functions on Manifolds [94.22412028413102]
Learning smooth functions is generally challenging, except in simple cases such as learning linear or kernel models. This work proposes to overcome these obstacles by combining techniques from semi-infinite constrained learning and manifold regularization. We prove that, under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct.
arXiv Detail & Related papers (2022-10-01T15:45:35Z)
Riemannian Stochastic Gradient Method for Nested Composition Optimization [0.0]
This work considers optimization of composition of functions in a nested form over Riemannian where each function contains an expectation. This type of problems is gaining popularity in applications such as policy evaluation in reinforcement learning or model customization in metalearning.
arXiv Detail & Related papers (2022-07-19T15:58:27Z)
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z)
A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions [3.4792548480344254]
We show that the risk function of the gradient descent method does indeed converge to zero. A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
arXiv Detail & Related papers (2021-02-19T13:33:03Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions. In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.