Convergence of the mini-batch SIHT algorithm
- URL: http://arxiv.org/abs/2209.14536v1
- Date: Thu, 29 Sep 2022 03:47:46 GMT
- Title: Convergence of the mini-batch SIHT algorithm
- Authors: Saeed Damadi, Jinglai Shen
- Abstract summary: The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations.
We show that the sequence generated by the sparse mini-batch SIHT is a supermartingale sequence and converges with probability one.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Iterative Hard Thresholding (IHT) algorithm has been considered
extensively as an effective deterministic algorithm for solving sparse
optimizations. The IHT algorithm benefits from the information of the batch
(full) gradient at each point and this information is a crucial key for the
convergence analysis of the generated sequence. However, this strength becomes
a weakness when it comes to machine learning and high dimensional statistical
applications because calculating the batch gradient at each iteration is
computationally expensive or impractical. Fortunately, in these applications
the objective function has a summation structure that can be taken advantage of
to approximate the batch gradient by the stochastic mini-batch gradient. In
this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving
the sparse optimizations. As opposed to previous works where increasing and
variable mini-batch size is necessary for derivation, we fix the mini-batch
size according to a lower bound that we derive and show our work. To prove
stochastic convergence of the objective value function we first establish a
critical sparse stochastic gradient descent property. Using this stochastic
gradient descent property we show that the sequence generated by the stochastic
mini-batch SIHT is a supermartingale sequence and converges with probability
one. Unlike previous work we do not assume the function to be a restricted
strongly convex. To the best of our knowledge, in the regime of sparse
optimization, this is the first time in the literature that it is shown that
the sequence of the stochastic function values converges with probability one
by fixing the mini-batch size for all steps.
Related papers
- Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Relationship between Batch Size and Number of Steps Needed for Nonconvex
Optimization of Stochastic Gradient Descent using Armijo Line Search [0.8158530638728501]
We show that SGD performs better than other deep learning networks when it uses deep numerical line.
The results indicate that the number of steps needed for SFO as the batch size grows can be estimated.
arXiv Detail & Related papers (2023-07-25T21:59:17Z) - A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random
Perturbations for Stochastic Optimization [10.820943271350442]
We present a convex gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples.
We also show that our algorithm avoids the ratelibria, implying convergence to local minima.
arXiv Detail & Related papers (2022-07-30T18:50:36Z) - STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - Stochastic Proximal Gradient Algorithm with Minibatches. Application to
Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components.
We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.