Importance Weighted Variational Inference without the Reparameterization Trick
- URL: http://arxiv.org/abs/2602.01412v1
- Date: Sun, 01 Feb 2026 19:39:30 GMT
- Title: Importance Weighted Variational Inference without the Reparameterization Trick
- Authors: Kamélia Daudel, Minh-Ngoc Tran, Cheng Zhang,
- Abstract summary: We show that state-of-the-art VIMCO gradient estimators exhibit a vanishing signal-to-noise ratio (SNR) as $N$ increases.<n>We propose the novel VIMCO-$star$ gradient estimator and show that it averts the SNR collapse of existing VIMCO gradient estimators.
- Score: 7.6837301319181535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Importance weighted variational inference (VI) approximates densities known up to a normalizing constant by optimizing bounds that tighten with the number of Monte Carlo samples $N$. Standard optimization relies on reparameterized gradient estimators, which are well-studied theoretically yet restrict both the choice of the data-generating process and the variational approximation. While REINFORCE gradient estimators do not suffer from such restrictions, they lack rigorous theoretical justification. In this paper, we provide the first comprehensive analysis of REINFORCE gradient estimators in importance weighted VI, leveraging this theoretical foundation to diagnose and resolve fundamental deficiencies in current state-of-the-art estimators. Specifically, we introduce and examine a generalized family of variational inference for Monte Carlo objectives (VIMCO) gradient estimators. We prove that state-of-the-art VIMCO gradient estimators exhibit a vanishing signal-to-noise ratio (SNR) as $N$ increases, which prevents effective optimization. To overcome this issue, we propose the novel VIMCO-$\star$ gradient estimator and show that it averts the SNR collapse of existing VIMCO gradient estimators by achieving a $\sqrt{N}$ SNR scaling instead. We demonstrate its superior empirical performance compared to current VIMCO implementations in challenging settings where reparameterized gradients are typically unavailable.
Related papers
- Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions [0.0]
In high-dimensional settings, unbiased estimators are generally inadmissible under quadratic loss.<n>We construct a shrinkage estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator.<n>We show that this estimator uniformly dominates the standard gradient under error loss and is minimax-optimal in the classical decision-theoretic sense.
arXiv Detail & Related papers (2026-02-02T08:01:13Z) - Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning [8.349781300731225]
We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs)<n>Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions.<n>Our approach addresses these challenges by: (i) adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance.
arXiv Detail & Related papers (2025-11-11T08:34:09Z) - On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z) - From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z) - Quantum Subgradient Estimation for Conditional Value-at-Risk Optimization [0.0]
Conditional Value-at-Risk (CVaR) is a leading tail-risk measure in finance.<n>We analyze a quantum subgradient oracle for CVaR minimization based on amplitude estimation.
arXiv Detail & Related papers (2025-10-06T12:09:43Z) - Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Unbiased Gradient Estimation for Distributionally Robust Learning [2.1777837784979277]
We consider a new approach based on distributionally robust learning (DRL) that applies gradient descent to the inner problem.
Our algorithm efficiently estimates gradient gradient through multi-level Monte Carlo randomization.
arXiv Detail & Related papers (2020-12-22T21:35:03Z) - On Signal-to-Noise Ratio Issues in Variational Inference for Deep
Gaussian Processes [55.62520135103578]
We show that the gradient estimates used in training Deep Gaussian Processes (DGPs) with importance-weighted variational inference are susceptible to signal-to-noise ratio (SNR) issues.
We show that our fix can lead to consistent improvements in the predictive performance of DGP models.
arXiv Detail & Related papers (2020-11-01T14:38:02Z) - Optimal Variance Control of the Score Function Gradient Estimator for
Importance Weighted Bounds [12.75471887147565]
This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE)
We prove that in the limit of large $K$ one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as $sqrtK$.
arXiv Detail & Related papers (2020-08-05T08:41:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.