SGD with Clipping is Secretly Estimating the Median Gradient
- URL: http://arxiv.org/abs/2402.12828v1
- Date: Tue, 20 Feb 2024 08:54:07 GMT
- Title: SGD with Clipping is Secretly Estimating the Median Gradient
- Authors: Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert Gower
- Abstract summary: We study distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself.
We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed state-dependent noise.
We propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.
- Score: 19.69067856415625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are several applications of stochastic optimization where one can
benefit from a robust estimate of the gradient. For example, domains such as
distributed learning with corrupted nodes, the presence of large outliers in
the training data, learning under privacy constraints, or even heavy-tailed
noise due to the dynamics of the algorithm itself. Here we study SGD with
robust gradient estimators based on estimating the median. We first consider
computing the median gradient across samples, and show that the resulting
method can converge even under heavy-tailed, state-dependent noise. We then
derive iterative methods based on the stochastic proximal point method for
computing the geometric median and generalizations thereof. Finally we propose
an algorithm estimating the median gradient across iterations, and find that
several well known methods - in particular different forms of clipping - are
particular cases of this framework.
Related papers
- Towards Provable Log Density Policy Gradient [6.0891236991406945]
Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning.
In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods.
We propose log density gradient to estimate the policy gradient, which corrects for this residual error term.
arXiv Detail & Related papers (2024-03-03T20:09:09Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Preferential Subsampling for Stochastic Gradient Langevin Dynamics [3.158346511479111]
gradient MCMC offers an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data.
The resulting gradient estimator may exhibit a high variance and impact sampler performance.
We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
arXiv Detail & Related papers (2022-10-28T14:56:18Z) - Convergence of Batch Stochastic Gradient Descent Methods with
Approximate Gradients and/or Noisy Measurements: Theory and Computational
Results [0.9900482274337404]
We study convex optimization using a very general formulation called BSGD (Block Gradient Descent)
We establish conditions for BSGD to converge to the global minimum, based on approximation theory.
We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
arXiv Detail & Related papers (2022-09-12T16:23:15Z) - Posterior and Computational Uncertainty in Gaussian Processes [52.26904059556759]
Gaussian processes scale prohibitively with the size of the dataset.
Many approximation methods have been developed, which inevitably introduce approximation error.
This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior.
We develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended.
arXiv Detail & Related papers (2022-05-30T22:16:25Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Pathwise Conditioning of Gaussian Processes [72.61885354624604]
Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at finite sets of input locations.
This distribution-centric characterization leads to generative strategies that scale cubically in the size of the desired random vector.
We show how this pathwise interpretation of conditioning gives rise to a general family of approximations that lend themselves to efficiently sampling Gaussian process posteriors.
arXiv Detail & Related papers (2020-11-08T17:09:37Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Nearest Neighbor Dirichlet Mixtures [3.3194866396158]
We propose a class of nearest neighbor-Dirichlet mixtures to maintain most of the strengths of Bayesian approaches without the computational disadvantages.
A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density.
arXiv Detail & Related papers (2020-03-17T21:39:11Z) - Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms [39.746670539407084]
We consider the problem of sampling from a strongly log-concave density in $bbRd$.
We prove an information theoretic lower bound on the number of gradient queries of the log density needed.
arXiv Detail & Related papers (2020-02-01T23:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.