Non-asymptotic estimates for TUSLA algorithm for non-convex learning
with applications to neural networks with ReLU activation function
- URL: http://arxiv.org/abs/2107.08649v2
- Date: Tue, 2 May 2023 15:45:00 GMT
- Title: Non-asymptotic estimates for TUSLA algorithm for non-convex learning
with applications to neural networks with ReLU activation function
- Authors: Dong-Young Lim, Ariel Neufeld, Sotirios Sabanis, Ying Zhang
- Abstract summary: We provide a non-asymptotic analysis for the tamed un-adjusted Langevin algorithm (TUSLA) introduced in Lovas et al.
In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wassersteinstein-1-2.
We show that the TUSLA algorithm converges rapidly to the optimal solution.
- Score: 3.5044892799305956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider non-convex stochastic optimization problems where the objective
functions have super-linearly growing and discontinuous stochastic gradients.
In such a setting, we provide a non-asymptotic analysis for the tamed
unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al.
(2020). In particular, we establish non-asymptotic error bounds for the TUSLA
algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result
enables us to further derive non-asymptotic estimates for the expected excess
risk. To illustrate the applicability of the main results, we consider an
example from transfer learning with ReLU neural networks, which represents a
key paradigm in machine learning. Numerical experiments are presented for the
aforementioned example which support our theoretical findings. Hence, in this
setting, we demonstrate both theoretically and numerically that the TUSLA
algorithm can solve the optimization problem involving neural networks with
ReLU activation function. Besides, we provide simulation results for synthetic
examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla)
stochastic gradient descent (SGD) algorithm, may fail to find the minimizer of
the objective functions due to the super-linear growth and the discontinuity of
the corresponding stochastic gradient, while the TUSLA algorithm converges
rapidly to the optimal solution. Moreover, we provide an empirical comparison
of the performance of TUSLA with popular stochastic optimizers on real-world
datasets, as well as investigate the effect of the key hyperparameters of TUSLA
on its performance.
Related papers
- Non-asymptotic convergence analysis of the stochastic gradient
Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with
applications to training of ReLU neural networks [8.058385158111207]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.
To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient [6.563379950720334]
We introduce a new Langevin dynamics based algorithm, called e-TH$varepsilon$O POULA, to solve optimization problems with discontinuous gradients.
Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction.
arXiv Detail & Related papers (2022-10-24T13:10:06Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Practical Precoding via Asynchronous Stochastic Successive Convex
Approximation [8.808993671472349]
We consider optimization of a smooth non-studied loss function with a convex non-smooth regularizer.
In this work, we take a closer look at the SCA algorithm and develop its asynchronous variant for resource allocation in wireless networks.
arXiv Detail & Related papers (2020-10-03T13:53:30Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Taming neural networks with TUSLA: Non-convex learning via adaptive
stochastic gradient Langevin algorithms [0.0]
We offer on an appropriately constructed gradient algorithm based on problematic Lange dynamics (SGLD)
We also provide nonasymptotic analysis of the use of new algorithm's convergence properties.
The roots of the TUSLA algorithm are based on the taming processes with developed coefficients citettamed-euler.
arXiv Detail & Related papers (2020-06-25T16:06:22Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.