Related papers: Improved Stochastic Optimization of LogSumExp

Improved Stochastic Optimization of LogSumExp

URL: http://arxiv.org/abs/2509.24894v1
Date: Mon, 29 Sep 2025 15:03:55 GMT
Title: Improved Stochastic Optimization of LogSumExp
Authors: Egor Gladin, Alexey Kroshnin, Jia-Jie Zhu, Pavel Dvurechensky,
Abstract summary: We propose a novel approximation to LogSumExp that can be efficiently optimized using gradient methods.<n>The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small.<n> Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach.
Score: 2.8547553943343797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an $L$-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with $L$. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.

Related papers

$γ$-weakly $θ$-up-concavity: Linearizable Non-Convex Optimization with Applications to DR-Submodular and OSS Functions [52.031993908548294]
We introduce and $$-weakly $$-up-concavity, a novel first-orderizing condition that characterizes a broad approximation of such functions.<n>Our framework recovers the optimal coefficient for DR-submodular class and significantly improves existing approximation coefficients.
arXiv Detail & Related papers (2026-02-13T22:34:44Z)
Variational Entropic Optimal Transport [67.76725267984578]
We propose Variational Entropic Optimal Transport (VarEOT) for domain translation problems.<n>VarEOT is based on an exact variational reformulation of the log-partition $log mathbbE[exp(cdot)$ as a tractable generalization over an auxiliary positive normalizer.<n> Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality.
arXiv Detail & Related papers (2026-02-02T15:48:44Z)
Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent [1.7188280334580195]
We revisit Toeplitz covariance estimation through the lens of simple descent (GD)<n>We show that when $K = P$, GD may converge to suboptimal solutions.<n>We propose an accelerated GD variant with separate learning rates for amplitudes and frequencies.
arXiv Detail & Related papers (2025-11-03T14:07:53Z)
More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2FOSGD), which integrates the Two-Scale Effective Dimension (2SED) with FOSGD.<n>By tracking sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate sluggish oscillations and hasten convergence.
arXiv Detail & Related papers (2025-05-05T19:27:36Z)
Gradient Descent Methods for Regularized Optimization [0.6624754673303327]
The gradient descent (GD) method is one of the primary methods used for numerical optimization of differentiable objective functions.<n>A more effective version of GD, called the proximal gradient descent, employs a technique known as soft-thresholding to shrink the iteration updates toward zero.<n>This paper proposes a novel algorithm for the proximal GD method that incorporates a variable step size.
arXiv Detail & Related papers (2024-12-28T10:54:15Z)
Moreau Envelope ADMM for Decentralized Weakly Convex Optimization [55.2289666758254]
This paper proposes a proximal variant of the alternating direction method of multipliers (ADMM) for distributed optimization. The results of our numerical experiments indicate that our method is faster and more robust than widely-used approaches.
arXiv Detail & Related papers (2023-08-31T14:16:30Z)
Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data. For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
Sinkhorn Distributionally Robust Optimization [18.46110328123008]
Sinkhorn distance is a variant of Wasserstein distance based on entropic regularization.<n>We derive a convex programming dual reformulation for general nominal distributions, transport costs, and loss functions.
arXiv Detail & Related papers (2021-09-24T12:40:48Z)
Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability. We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections. Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z)
SGB: Stochastic Gradient Bound Method for Optimizing Partition Functions [15.33098084159285]
This paper addresses the problem of optimizing partition functions in a learning setting. We propose a variant of the bound majorization algorithm that relies on upper-bounding the partition function with a quadratic surrogate.
arXiv Detail & Related papers (2020-11-03T04:42:51Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.