Related papers: Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

URL: http://arxiv.org/abs/2504.19259v1
Date: Sun, 27 Apr 2025 14:39:33 GMT
Title: Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence
Authors: Adwait Datar, Nihat Ay,
Abstract summary: We study the problem of minimizing the Kullback-Leibler (KL) divergence.<n>We analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($\theta$ coordinates) and the mixture family ($\eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the parameter space. In continuous time, we prove that the convergence rates of GD in the $\theta$ and $\eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $\eta$ and $\theta$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.

Related papers

Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives [6.740173664466834]
The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games.<n>We investigate the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings.
arXiv Detail & Related papers (2025-01-28T18:13:41Z)
Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions. Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z)
Convergence of coordinate ascent variational inference for log-concave measures via optimal transport [0.0]
Mean field inference (VI) is the problem of finding the closest product (factorized) measure. The well known Ascent Variational Inference (CAVI) aims this approximate measure by variation over one coordinate.
arXiv Detail & Related papers (2024-04-12T19:43:54Z)
Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization [108.35402316802765]
We propose a new first-order optimization algorithm -- AcceleratedGradient-OptimisticGradient (AG-OG) Ascent. We show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings. We extend our algorithm to extend the setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings.
arXiv Detail & Related papers (2022-10-31T17:59:29Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Robust Training in High Dimensions via Block Coordinate Geometric Median Descent [69.47594803719333]
Geometric median (textGm) is a classical method in statistics for achieving a robust estimation of the uncorrupted data. In this paper, we show that by that by applying textscGm to only a chosen block of coordinates at a time, one can retain a breakdown point of 0.5 judiciously for smooth nontext problems.
arXiv Detail & Related papers (2021-06-16T15:55:50Z)
Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods [12.173568611144626]
We show that a first-order saddle gradient descent (SCSG) method can be perturbed by noise or a step. A separate step is proposed to help solve this problem. The proposed step is designed to incorporate the proposed CNC-SCSGD method further for saddle points.
arXiv Detail & Related papers (2021-03-07T18:09:43Z)
A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG) $lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG. $mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z)
Proximal Gradient Descent-Ascent: Variable Convergence under K{\L} Geometry [49.65455534654459]
The finite descent-ascent parameters (GDA) has been widely applied to solve minimax optimization problems. This paper fills such a gap by studying the convergence of the KL-Lized geometry.
arXiv Detail & Related papers (2021-02-09T05:35:53Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate. We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.