Related papers: Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

URL: http://arxiv.org/abs/2410.10533v1
Date: Mon, 14 Oct 2024 14:11:37 GMT
Title: Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation
Authors: Sonja Hannibal, Arnulf Jentzen, Do Minh Thang,
Abstract summary: It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms. In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
Score: 3.6185342807265415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers.

Related papers

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks [5.052293146674794]
It remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to optimal true risk value in the training of DNNs. In the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered converges in probability to the optimal true risk value.
arXiv Detail & Related papers (2025-03-03T15:36:01Z)
GDSG: Graph Diffusion-based Solution Generator for Optimization Problems in MEC Networks [109.17835015018532]
We present a Graph Diffusion-based Solution Generation (GDSG) method. This approach is designed to work with suboptimal datasets while converging to the optimal solution large probably. We build GDSG as a multi-task diffusion model utilizing a Graph Neural Network (GNN) to acquire the distribution of high-quality solutions.
arXiv Detail & Related papers (2024-12-11T11:13:43Z)
On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community. Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z)
The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings. This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z)
Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems. Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z)
Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networks [6.708125191843434]
It remains an open problem to rigorously explain why SGD methods seem to succeed to train ANNs. We prove that SGD methods can find a global minimizer with high probability. Even stronger, we reveal in the training of such ANNs that SGD methods do with high probability fail to converge to global minimizers.
arXiv Detail & Related papers (2024-02-07T16:14:04Z)
Achieving Constraints in Neural Networks: A Stochastic Augmented Lagrangian Approach [49.1574468325115]
Regularizing Deep Neural Networks (DNNs) is essential for improving generalizability and preventing overfitting. We propose a novel approach to DNN regularization by framing the training process as a constrained optimization problem. We employ the Augmented Lagrangian (SAL) method to achieve a more flexible and efficient regularization mechanism.
arXiv Detail & Related papers (2023-10-25T13:55:35Z)
Membrane Potential Distribution Adjustment and Parametric Surrogate Gradient in Spiking Neural Networks [3.485537704990941]
Surrogate gradient (SG) strategy is investigated and applied to circumvent this issue and train SNNs from scratch. We propose the parametric surrogate gradient (PSG) method to iteratively update SG and eventually determine an optimal surrogate gradient parameter. Experimental results demonstrate that the proposed methods can be readily integrated with backpropagation through time (BPTT) algorithm.
arXiv Detail & Related papers (2023-04-26T05:02:41Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.