Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte
Carlo
- URL: http://arxiv.org/abs/2305.19350v1
- Date: Tue, 30 May 2023 18:25:11 GMT
- Title: Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte
Carlo
- Authors: Wei Deng
- Abstract summary: The rise of artificial intelligence (AI) hinges on efficient of modern deep neural networks (DNNs) for non-trips and uncertainty.
In this thesis we propose a tool to handle the problem of Monte Carlo exploitation.
We also propose two dynamic importance sampling algorithms for the underlying ordinary equation (ODE) system.
- Score: 4.656426393230839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of artificial intelligence (AI) hinges on the efficient training of
modern deep neural networks (DNNs) for non-convex optimization and uncertainty
quantification, which boils down to a non-convex Bayesian learning problem. A
standard tool to handle the problem is Langevin Monte Carlo, which proposes to
approximate the posterior distribution with theoretical guarantees. In this
thesis, we start with the replica exchange Langevin Monte Carlo (also known as
parallel tempering), which proposes appropriate swaps between exploration and
exploitation to achieve accelerations. However, the na\"ive extension of swaps
to big data problems leads to a large bias, and bias-corrected swaps are
required. Such a mechanism leads to few effective swaps and insignificant
accelerations. To alleviate this issue, we first propose a control variates
method to reduce the variance of noisy energy estimators and show a potential
to accelerate the exponential convergence. We also present the population-chain
replica exchange based on non-reversibility and obtain an optimal round-trip
rate for deep learning. In the second part of the thesis, we study scalable
dynamic importance sampling algorithms based on stochastic approximation.
Traditional dynamic importance sampling algorithms have achieved success,
however, the lack of scalability has greatly limited their extensions to big
data. To handle this scalability issue, we resolve the vanishing gradient
problem and propose two dynamic importance sampling algorithms. Theoretically,
we establish the stability condition for the underlying ordinary differential
equation (ODE) system and guarantee the asymptotic convergence of the latent
variable to the desired fixed point. Interestingly, such a result still holds
given non-convex energy landscapes.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks [14.139047596566485]
HERTA is a high-efficiency and rigorous training algorithm for Unfolded GNNs.
HERTA converges to the optimum of the original model, thus preserving the interpretability of Unfolded GNNs.
As a byproduct of HERTA, we propose a new spectral sparsification method applicable to normalized and regularized graph Laplacians.
arXiv Detail & Related papers (2024-03-26T23:03:06Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Stochastic Methods in Variational Inequalities: Ergodicity, Bias and
Refinements [19.524063429548278]
Extragradient (SEG) and Gradient Descent Ascent (SGDA) are preeminent algorithms for min-max optimization and variational inequalities problems.
Our work endeavors to quantify and quantify the intrinsic structures intrinsic to these algorithms.
By recasting the constant step-size SEG/SGDA as time-homogeneous Markov Chains, we establish a first-of-its-kind Law of Large Numbers and a Central Limit Theorem.
arXiv Detail & Related papers (2023-06-28T18:50:07Z) - An Optimization-based Deep Equilibrium Model for Hyperspectral Image
Deconvolution with Convergence Guarantees [71.57324258813675]
We propose a novel methodology for addressing the hyperspectral image deconvolution problem.
A new optimization problem is formulated, leveraging a learnable regularizer in the form of a neural network.
The derived iterative solver is then expressed as a fixed-point calculation problem within the Deep Equilibrium framework.
arXiv Detail & Related papers (2023-06-10T08:25:16Z) - Can Decentralized Stochastic Minimax Optimization Algorithms Converge
Linearly for Finite-Sum Nonconvex-Nonconcave Problems? [56.62372517641597]
Decentralized minimax optimization has been actively studied in the past few years due to its application in a wide range machine learning.
This paper develops two novel decentralized minimax optimization algorithms for the non-strongly-nonconcave problem.
arXiv Detail & Related papers (2023-04-24T02:19:39Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.