Variance reduction techniques for stochastic proximal point algorithms
- URL: http://arxiv.org/abs/2308.09310v3
- Date: Tue, 6 Aug 2024 10:38:22 GMT
- Title: Variance reduction techniques for stochastic proximal point algorithms
- Authors: Cheik Traoré, Vassilis Apidopoulos, Saverio Salzo, Silvia Villa,
- Abstract summary: In this work, we propose the first unified study of variance reduction techniques for proximal point algorithms.
We introduce a generic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants.
Our experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts.
- Score: 5.374800961359305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of $O(1/k)$. In addition, under the Polyak-{\L}ojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.
Related papers
- Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods [0.3222802562733786]
We consider minimizing finite-sum expectation objective functions via Hessian-averaging based subsampled Newton methods.
These methods allow for inexactness and have fixed per-it Hessian approximation costs.
We present novel analysis techniques and propose challenges for their practical implementation.
arXiv Detail & Related papers (2024-08-14T03:27:48Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - High-Probability Bounds for Stochastic Optimization and Variational
Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions.
These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z) - Reinforcement Learning with Unbiased Policy Evaluation and Linear
Function Approximation [11.345796608258434]
We provide performance guarantees for a variant of simulation-based policy iteration for controlling Markov decision processes.
We analyze two algorithms; the first algorithm involves a least squares approach where a new set of weights associated with feature vectors is obtained via at least squares at each iteration.
The second algorithm involves a two-time-scale approximation algorithm taking several steps of gradient descent towards the least squares solution.
arXiv Detail & Related papers (2022-10-13T20:16:19Z) - Exponential Concentration in Stochastic Approximation [0.8192907805418583]
We analyze the behavior of approximation algorithms where iterates, in expectation, progress towards an objective at each step.
We apply our results to several different Markov Approximation algorithms, specifically Projected Gradient Descent, Kiefer-Wolfowitz and Frank-Wolfe algorithms.
arXiv Detail & Related papers (2022-08-15T14:57:26Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Multi-kernel Passive Stochastic Gradient Algorithms and Transfer
Learning [21.796874356469644]
The gradient algorithm does not have control over the location where noisy gradients of the cost function are evaluated.
The algorithm performs substantially better in high dimensional problems and incorporates variance reduction.
arXiv Detail & Related papers (2020-08-23T11:55:19Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - Proximal Gradient Temporal Difference Learning: Stable Reinforcement
Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms.
We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z) - Optimal Randomized First-Order Methods for Least-Squares Problems [56.05635751529922]
This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems.
We focus on two classical embeddings, namely, Gaussian projections and subsampled Hadamard transforms.
Our resulting algorithm yields the best complexity known for solving least-squares problems with no condition number dependence.
arXiv Detail & Related papers (2020-02-21T17:45:32Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.