Stochastic Reweighted Gradient Descent
- URL: http://arxiv.org/abs/2103.12293v1
- Date: Tue, 23 Mar 2021 04:09:43 GMT
- Title: Stochastic Reweighted Gradient Descent
- Authors: Ayoub El Hanchi, David A. Stephens
- Abstract summary: We propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient)
We pay particular attention to the time and memory overhead of our proposed method.
We present empirical results to support our findings.
- Score: 4.355567556995855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the strong theoretical guarantees that variance-reduced finite-sum
optimization algorithms enjoy, their applicability remains limited to cases
where the memory overhead they introduce (SAG/SAGA), or the periodic full
gradient computation they require (SVRG/SARAH) are manageable. A promising
approach to achieving variance reduction while avoiding these drawbacks is the
use of importance sampling instead of control variates. While many such methods
have been proposed in the literature, directly proving that they improve the
convergence of the resulting optimization algorithm has remained elusive. In
this work, we propose an importance-sampling-based algorithm we call SRG
(stochastic reweighted gradient). We analyze the convergence of SRG in the
strongly-convex case and show that, while it does not recover the linear rate
of control variates methods, it provably outperforms SGD. We pay particular
attention to the time and memory overhead of our proposed method, and design a
specialized red-black tree allowing its efficient implementation. Finally, we
present empirical results to support our findings.
Related papers
- Sample-efficient Bayesian Optimisation Using Known Invariances [56.34916328814857]
We show that vanilla and constrained BO algorithms are inefficient when optimising invariant objectives.
We derive a bound on the maximum information gain of these invariant kernels.
We use our method to design a current drive system for a nuclear fusion reactor, finding a high-performance solution.
arXiv Detail & Related papers (2024-10-22T12:51:46Z) - Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Efficiency Ordering of Stochastic Gradient Descent [9.634481296779057]
We consider the gradient descent (SGD) algorithm driven by a general sampling sequence, including i.i.i.d noise and random walk on an arbitrary graph.
We employ the notion of efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo samplers.
arXiv Detail & Related papers (2022-09-15T16:50:55Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Adaptive Importance Sampling for Finite-Sum Optimization and Sampling
with Decreasing Step-Sizes [4.355567556995855]
We propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes.
Under standard technical conditions, we show that Avare achieves $mathcalO(T2/3)$ and $mathcalO(T5/6)$ dynamic regret for SGD and SGLD respectively when run with $mathcalO(T5/6)$ step sizes.
arXiv Detail & Related papers (2021-03-23T00:28:15Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Stochastic Approximate Gradient Descent via the Langevin Algorithm [11.36635610546803]
We introduce the approximate gradient descent (SAGD) as an alternative to the gradient descent for cases where unbiased gradients cannot be trivially obtained.
We show that SAGD performs well experimentally in popular statistical and machine learning problems such as the expectation-maximization algorithm and the variational autoencoders.
arXiv Detail & Related papers (2020-02-13T14:29:21Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.