Convergence of Batch Stochastic Gradient Descent Methods with
Approximate Gradients and/or Noisy Measurements: Theory and Computational
Results
- URL: http://arxiv.org/abs/2209.05372v1
- Date: Mon, 12 Sep 2022 16:23:15 GMT
- Title: Convergence of Batch Stochastic Gradient Descent Methods with
Approximate Gradients and/or Noisy Measurements: Theory and Computational
Results
- Authors: Rajeeva L. Karandikar, Tadipatri Uday Kiran Reddy and M. Vidyasagar
- Abstract summary: We study convex optimization using a very general formulation called BSGD (Block Gradient Descent)
We establish conditions for BSGD to converge to the global minimum, based on approximation theory.
We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
- Score: 0.9900482274337404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study convex optimization using a very general formulation
called BSGD (Block Stochastic Gradient Descent). At each iteration, some but
not necessary all components of the argument are updated. The direction of the
update can be one of two possibilities: (i) A noise-corrupted measurement of
the true gradient, or (ii) an approximate gradient computed using a first-order
approximation, using function values that might themselves be corrupted by
noise. This formulation embraces most of the currently used stochastic gradient
methods. We establish conditions for BSGD to converge to the global minimum,
based on stochastic approximation theory. Then we verify the predicted
convergence through numerical experiments. Out results show that when
approximate gradients are used, BSGD converges while momentum-based methods can
diverge. However, not just our BSGD, but also standard (full-update) gradient
descent, and various momentum-based methods, all converge, even with noisy
gradients.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - SGD with Clipping is Secretly Estimating the Median Gradient [19.69067856415625]
We study distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself.
We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed state-dependent noise.
We propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.
arXiv Detail & Related papers (2024-02-20T08:54:07Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Convergence of First-Order Methods for Constrained Nonconvex
Optimization with Dependent Data [7.513100214864646]
We show the worst-case complexity of convergence $tildeO(t-1/4)$ and MoreautildeO(vareps-4)$ for smooth non- optimization.
We obtain first online nonnegative matrix factorization algorithms for dependent data based on projected gradient methods with adaptive step sizes and optimal convergence.
arXiv Detail & Related papers (2022-03-29T17:59:10Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent.
Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z) - On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression.
We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.