Gradients should stay on Path: Better Estimators of the Reverse- and
Forward KL Divergence for Normalizing Flows
- URL: http://arxiv.org/abs/2207.08219v1
- Date: Sun, 17 Jul 2022 16:27:41 GMT
- Title: Gradients should stay on Path: Better Estimators of the Reverse- and
Forward KL Divergence for Normalizing Flows
- Authors: Lorenz Vaitl, Kim A. Nicoli, Shinichi Nakajima, Pan Kessel
- Abstract summary: We propose an algorithm to estimate the path-gradient of both the reverse and forward Kullback-Leibler divergence for an arbitrary manifestly invertible normalizing flow.
The resulting path-gradient estimators are straightforward to implement, have lower variance, and lead not only to faster convergence of training but also to better overall approximation results.
- Score: 4.830811539001643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an algorithm to estimate the path-gradient of both the reverse and
forward Kullback-Leibler divergence for an arbitrary manifestly invertible
normalizing flow. The resulting path-gradient estimators are straightforward to
implement, have lower variance, and lead not only to faster convergence of
training but also to better overall approximation results compared to standard
total gradient estimators. We also demonstrate that path-gradient training is
less susceptible to mode-collapse. In light of our results, we expect that
path-gradient estimators will become the new standard method to train
normalizing flows for variational inference.
Related papers
- On Divergence Measures for Training GFlowNets [3.7277730514654555]
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects.
Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution.
We review four divergence measures, namely, Renyi-$alpha$'s, Tsallis-$alpha$'s, reverse and forward KL's, and design statistically efficient estimators for their gradients in the context of training GFlowNets
arXiv Detail & Related papers (2024-10-12T03:46:52Z) - Pathwise Gradient Variance Reduction with Control Variates in Variational Inference [2.1638817206926855]
Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution.
In these cases, pathwise and score-function gradient estimators are the most common approaches.
Recent research suggests that even pathwise gradient estimators could benefit from variance reduction.
arXiv Detail & Related papers (2024-10-08T07:28:46Z) - Fast and Unified Path Gradient Estimators for Normalizing Flows [5.64979077798699]
path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference.
We propose a fast path gradient estimator which improves computational efficiency significantly.
We empirically establish its superior performance and reduced variance for several natural sciences applications.
arXiv Detail & Related papers (2024-03-23T16:21:22Z) - Can Forward Gradient Match Backpropagation? [2.875726839945885]
Forward Gradients have been shown to be utilizable for neural network training.
We propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks.
We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.
arXiv Detail & Related papers (2023-06-12T08:53:41Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Path-Gradient Estimators for Continuous Normalizing Flows [4.830811539001643]
Recent work has established a path-gradient estimator for simple variational Gaussian distributions.
We propose a path-gradient estimator for the considerably more expressive variational family of continuous normalizing flows.
arXiv Detail & Related papers (2022-06-17T21:25:06Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.