Related papers: Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications

Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications

URL: http://arxiv.org/abs/2602.04272v1
Date: Wed, 04 Feb 2026 07:01:56 GMT
Title: Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications
Authors: Peiwen Jiang, Takuo Matsubara, Minh-Ngoc Tran,
Abstract summary: Importance-Weighted Evidence Lower Bound (IW-ELBO) has emerged as an effective objective for variational inference (VI)<n>This paper formulates the optimisation of the IW-ELBO in Bures-Wasserstein space.<n>A pivotal contribution of our analysis concerns the stability of the gradient estimator.
Score: 10.150648641677828
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Importance-Weighted Evidence Lower Bound (IW-ELBO) has emerged as an effective objective for variational inference (VI), tightening the standard ELBO and mitigating the mode-seeking behaviour. However, optimizing the IW-ELBO in Euclidean space is often inefficient, as its gradient estimators suffer from a vanishing signal-to-noise ratio (SNR). This paper formulates the optimisation of the IW-ELBO in Bures-Wasserstein space, a manifold of Gaussian distributions equipped with the 2-Wasserstein metric. We derive the Wasserstein gradient of the IW-ELBO and project it onto the Bures-Wasserstein space to yield a tractable algorithm for Gaussian VI. A pivotal contribution of our analysis concerns the stability of the gradient estimator. While the SNR of the standard Euclidean gradient estimator is known to vanish as the number of importance samples $K$ increases, we prove that the SNR of the Wasserstein gradient scales favourably as $Ω(\sqrt{K})$, ensuring optimisation efficiency even for large $K$. We further extend this geometric analysis to the Variational Rényi Importance-Weighted Autoencoder bound, establishing analogous stability guarantees. Experiments demonstrate that the proposed framework achieves superior approximation performance compared to other baselines.

Related papers

G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning [3.686808512438363]
We present Wasserstein Adaptive Value Estimation for Actor-Critic (WAVE)<n>WAVE addresses the inherent instability of actor-critic algorithms by incorporating an adaptively weighted Wasserstein regularization term into the critic's loss function.<n>We prove that WAVE achieves $mathcalOleft(frac1kright)$ convergence rate for the critic's mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization.
arXiv Detail & Related papers (2025-01-17T23:37:21Z)
Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization [18.47705532817026]
Adaptive gradient methods are among the most successful neural network training algorithms.<n>These methods are known to achieve better dimensional dependence than over convex SGD-normarity.<n>In this paper we introduce new assumptions on the smoothness of the structure and the gradient noise variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z)
Closed-form Filtering for Non-linear Systems [83.91296397912218]
We propose a new class of filters based on Gaussian PSD Models, which offer several advantages in terms of density approximation and computational efficiency. We show that filtering can be efficiently performed in closed form when transitions and observations are Gaussian PSD Models. Our proposed estimator enjoys strong theoretical guarantees, with estimation error that depends on the quality of the approximation and is adaptive to the regularity of the transition probabilities.
arXiv Detail & Related papers (2024-02-15T08:51:49Z)
Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference [0.0]
This paper presents a groundbreaking approach to causal inference by integrating continuous normalizing flows with parametric submodels. We leverage optimal transport and Wasserstein gradient flows to develop causal inference methodologies with minimal variance in finite-sample settings. Preliminary experiments showcase our method's superiority, yielding lower mean-squared errors compared to standard flows.
arXiv Detail & Related papers (2023-11-30T18:59:05Z)
Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD) We use gradient new outliers as norm clipping chains. We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z)
Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples. We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds [12.75471887147565]
This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE) We prove that in the limit of large $K$ one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as $sqrtK$.
arXiv Detail & Related papers (2020-08-05T08:41:46Z)
On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification [101.0377583883137]
Projection robust (PR) OT seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces.
arXiv Detail & Related papers (2020-06-22T14:35:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.