Pathwise Gradient Variance Reduction with Control Variates in Variational Inference
- URL: http://arxiv.org/abs/2410.05753v1
- Date: Tue, 8 Oct 2024 07:28:46 GMT
- Title: Pathwise Gradient Variance Reduction with Control Variates in Variational Inference
- Authors: Kenyon Ng, Susan Wei,
- Abstract summary: Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution.
In these cases, pathwise and score-function gradient estimators are the most common approaches.
Recent research suggests that even pathwise gradient estimators could benefit from variance reduction.
- Score: 2.1638817206926855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it.
Related papers
- Gradients should stay on Path: Better Estimators of the Reverse- and
Forward KL Divergence for Normalizing Flows [4.830811539001643]
We propose an algorithm to estimate the path-gradient of both the reverse and forward Kullback-Leibler divergence for an arbitrary manifestly invertible normalizing flow.
The resulting path-gradient estimators are straightforward to implement, have lower variance, and lead not only to faster convergence of training but also to better overall approximation results.
arXiv Detail & Related papers (2022-07-17T16:27:41Z) - Path-Gradient Estimators for Continuous Normalizing Flows [4.830811539001643]
Recent work has established a path-gradient estimator for simple variational Gaussian distributions.
We propose a path-gradient estimator for the considerably more expressive variational family of continuous normalizing flows.
arXiv Detail & Related papers (2022-06-17T21:25:06Z) - Gradient Estimation with Discrete Stein Operators [44.64146470394269]
We introduce a variance reduction technique based on Stein operators for discrete distributions.
Our technique achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
arXiv Detail & Related papers (2022-02-19T02:22:23Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Coordinate-wise Control Variates for Deep Policy Gradients [23.24910014825916]
The effect of vector-valued baselines for neural net policies is under-explored.
We show that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline.
arXiv Detail & Related papers (2021-07-11T07:36:01Z) - VarGrad: A Low-Variance Gradient Estimator for Variational Inference [9.108412698936105]
We show that VarGrad offers a favourable variance versus trade-off compared to other state-of-the-art estimators on a discrete VAE.
arXiv Detail & Related papers (2020-10-20T16:46:01Z) - Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient
Estimator [93.05919133288161]
We show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can be reduced through Rao-Blackwellization.
This provably reduces the mean squared error.
We empirically demonstrate that this leads to variance reduction, faster convergence, and generally improved performance in two unsupervised latent variable models.
arXiv Detail & Related papers (2020-10-09T22:54:38Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - A One-step Approach to Covariate Shift Adaptation [82.01909503235385]
A default assumption in many machine learning scenarios is that the training and test samples are drawn from the same probability distribution.
We propose a novel one-step approach that jointly learns the predictive model and the associated weights in one optimization.
arXiv Detail & Related papers (2020-07-08T11:35:47Z) - Scalable Control Variates for Monte Carlo Methods via Stochastic
Optimization [62.47170258504037]
This paper presents a framework that encompasses and generalizes existing approaches that use controls, kernels and neural networks.
Novel theoretical results are presented to provide insight into the variance reduction that can be achieved, and an empirical assessment, including applications to Bayesian inference, is provided in support.
arXiv Detail & Related papers (2020-06-12T22:03:25Z) - Estimating Gradients for Discrete Random Variables by Sampling without
Replacement [93.09326095997336]
We derive an unbiased estimator for expectations over discrete random variables based on sampling without replacement.
We show that our estimator can be derived as the Rao-Blackwellization of three different estimators.
arXiv Detail & Related papers (2020-02-14T14:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.