Related papers: SGD with Clipping is Secretly Estimating the Median Gradient

SGD with Clipping is Secretly Estimating the Median Gradient

URL: http://arxiv.org/abs/2402.12828v1
Date: Tue, 20 Feb 2024 08:54:07 GMT
Title: SGD with Clipping is Secretly Estimating the Median Gradient
Authors: Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert Gower
Abstract summary: We study distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed state-dependent noise. We propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.
Score: 19.69067856415625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise. We then derive iterative methods based on the stochastic proximal point method for computing the geometric median and generalizations thereof. Finally we propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.

Related papers

Local Averaging Accurately Distills Manifold Structure From Noisy Data [4.63748375343038]
Local averaging is a cornerstone of state-of-the-art provable methods for manifold fitting and denoising.<n>We provide theoretical analyses of a two-round mini-batch local averaging method applied to noisy samples drawn from a $d$-dimensional manifold.<n>The proposed method can serve as a preprocessing step for a wide range of provable methods designed for lower-noise regimes.
arXiv Detail & Related papers (2025-06-23T15:32:16Z)
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
Survey on Algorithms for multi-index models [45.143425167349314]
We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity.
arXiv Detail & Related papers (2025-04-07T18:50:11Z)
Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization [0.5120567378386615]
First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning. However, these methods do not exploit curvature information. Quasi-Newton methods re-use previously computed low Hessian approximations.
arXiv Detail & Related papers (2025-02-17T20:20:11Z)
A Historical Trajectory Assisted Optimization Method for Zeroth-Order Federated Learning [24.111048817721592]
Federated learning heavily relies on distributed gradient descent techniques. In the situation where gradient information is not available, gradients need to be estimated from zeroth-order information. We propose a non-isotropic sampling method to improve the gradient estimation procedure.
arXiv Detail & Related papers (2024-09-24T10:36:40Z)
A quasi-Bayesian sequential approach to deconvolution density estimation [7.10052009802944]
Density deconvolution addresses the estimation of the unknown density function $f$ of a random signal from data. We consider the problem of density deconvolution in a streaming or online setting where noisy data arrive progressively. By relying on a quasi-Bayesian sequential approach, we obtain estimates of $f$ that are of easy evaluation.
arXiv Detail & Related papers (2024-08-26T16:40:04Z)
Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients [0.8749675983608172]
We present an unbiased method for posterior means based on kinetic Langevin dynamics. Our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. Our results demonstrate that in large-scale applications, the unbiased algorithm we present can be 2-3 orders of magnitude more efficient than the gold-standard" randomized Hamiltonian Monte Carlo.
arXiv Detail & Related papers (2023-11-08T21:19:52Z)
Neural Gradient Learning and Optimization for Oriented Point Normal Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z)
Convergence of Batch Stochastic Gradient Descent Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results [0.9900482274337404]
We study convex optimization using a very general formulation called BSGD (Block Gradient Descent) We establish conditions for BSGD to converge to the global minimum, based on approximation theory. We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
arXiv Detail & Related papers (2022-09-12T16:23:15Z)
Posterior and Computational Uncertainty in Gaussian Processes [52.26904059556759]
Gaussian processes scale prohibitively with the size of the dataset. Many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. We develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended.
arXiv Detail & Related papers (2022-05-30T22:16:25Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Pathwise Conditioning of Gaussian Processes [72.61885354624604]
Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at finite sets of input locations. This distribution-centric characterization leads to generative strategies that scale cubically in the size of the desired random vector. We show how this pathwise interpretation of conditioning gives rise to a general family of approximations that lend themselves to efficiently sampling Gaussian process posteriors.
arXiv Detail & Related papers (2020-11-08T17:09:37Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms [39.746670539407084]
We consider the problem of sampling from a strongly log-concave density in $bbRd$. We prove an information theoretic lower bound on the number of gradient queries of the log density needed.
arXiv Detail & Related papers (2020-02-01T23:46:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.