Semi-Implicit Variational Inference via Kernelized Path Gradient Descent
- URL: http://arxiv.org/abs/2506.05088v1
- Date: Thu, 05 Jun 2025 14:34:37 GMT
- Title: Semi-Implicit Variational Inference via Kernelized Path Gradient Descent
- Authors: Tobias Pielok, Bernd Bischl, David RĂ¼gamer,
- Abstract summary: Training with the Kullback-Leibler divergence can be challenging due to high variance and bias in high-dimensional settings.<n>We propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing.<n>Our method's bias in function space is benign, leading to more stable and efficient optimization.
- Score: 12.300415631357406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback-Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art semi-implicit variational inference methods, particularly Kernel Semi-Implicit Variational Inference (KSIVI), have been shown to work in high dimensions, training remains moderately expensive. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein's identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method's bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art SIVI methods in both performance and training efficiency.
Related papers
- Kernel Semi-Implicit Variational Inference [27.61976547543748]
Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner.
A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation.
We propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks.
arXiv Detail & Related papers (2024-05-29T11:21:25Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models
via Reparameterisation and Smoothing [1.6114012813668932]
We introduce a simple framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings.
Our main contribution is a novel variant of SGD, Diagonalisation Gradient Descent, which progressively enhances the accuracy of the smoothed approximation.
Our approach is simple, fast stable and attains orders of magnitude reduction in work-normalised variance.
arXiv Detail & Related papers (2024-02-19T00:43:22Z) - Byzantine-Robust Decentralized Stochastic Optimization with Stochastic
Gradient Noise-Independent Learning Error [25.15075119957447]
We study Byzantine-robust optimization over a decentralized network, where every agent periodically communicates with its neighbors to exchange local models, and then updates its own local model by gradient descent (SGD)
The performance of such a method is affected by an unknown number of Byzantine agents, which conduct adversarially during the optimization process.
arXiv Detail & Related papers (2023-08-10T02:14:23Z) - Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - Learning Discretized Neural Networks under Ricci Flow [48.47315844022283]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.<n>DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Training Generative Adversarial Networks by Solving Ordinary
Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training.
From this perspective, we hypothesise that instabilities in training GANs arise from the integration error.
We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.