Clip21: Error Feedback for Gradient Clipping
        - URL: http://arxiv.org/abs/2305.18929v1
- Date: Tue, 30 May 2023 10:41:42 GMT
- Title: Clip21: Error Feedback for Gradient Clipping
- Authors: Sarit Khirirat, Eduard Gorbunov, Samuel Horv\'ath, Rustem Islamov,
  Fakhri Karray, Peter Richt\'arik
- Abstract summary: We design Clip21 -- the first provably effective and practically useful feedback mechanism for distributed methods.
Our method converges faster in practice than competing methods.
- Score: 8.979288425347702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Motivated by the increasing popularity and importance of large-scale training
under differential privacy (DP) constraints, we study distributed gradient
methods with gradient clipping, i.e., clipping applied to the gradients
computed from local information at the nodes. While gradient clipping is an
essential tool for injecting formal DP guarantees into gradient-based methods
[1], it also induces bias which causes serious convergence issues specific to
the distributed setting. Inspired by recent progress in the error-feedback
literature which is focused on taming the bias/error introduced by
communication compression operators such as Top-$k$ [2], and mathematical
similarities between the clipping operator and contractive compression
operators, we design Clip21 -- the first provably effective and practically
useful error feedback mechanism for distributed methods with gradient clipping.
We prove that our method converges at the same
$\mathcal{O}\left(\frac{1}{K}\right)$ rate as distributed gradient descent in
the smooth nonconvex regime, which improves the previous best
$\mathcal{O}\left(\frac{1}{\sqrt{K}}\right)$ rate which was obtained under
significantly stronger assumptions. Our method converges significantly faster
in practice than competing methods.
 
      
        Related papers
        - Greedy Low-Rank Gradient Compression for Distributed Learning with   Convergence Guarantees [13.806112971330482]
 We propose the first Greedy Low-Rank compression algorithm for distributed learning with rigorous convergence guarantees.<n>We prove that GreedyLore achieves a convergence rate of $mathcalO(sigma/sqrtNT + 1/T)$ under standards such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression.
 arXiv  Detail & Related papers  (2025-07-11T17:46:12Z)
- Clip Body and Tail Separately: High Probability Guarantees for DPSGD   with Heavy Tails [20.432871178766927]
 Differentially Private Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning.
DPSGD clips the gradients to a norm and then injects a calibrated noise into the training procedure.
We propose a novel approach, Discriminative(DC)-DPSGD, with two key iterations.
 arXiv  Detail & Related papers  (2024-05-27T16:30:11Z)
- Flattened one-bit stochastic gradient descent: compressed distributed   optimization with controlled variance [55.01966743652196]
 We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
 arXiv  Detail & Related papers  (2024-05-17T21:17:27Z)
- Adaptive Federated Learning Over the Air [108.62635460744109]
 We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha   ).
 arXiv  Detail & Related papers  (2024-03-11T09:10:37Z)
- Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
 We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
 arXiv  Detail & Related papers  (2023-10-31T16:15:13Z)
- Towards More Robust Interpretation via Local Gradient Alignment [37.464250451280336]
 We show that for every non-negative homogeneous neural network, a naive $ell$-robust criterion for gradients is textitnot normalization invariant.
We propose to combine both $ell$ and cosine distance-based criteria as regularization terms to leverage the advantages of both in aligning the local gradient.
We experimentally show that models trained with our method produce much more robust interpretations on CIFAR-10 and ImageNet-100.
 arXiv  Detail & Related papers  (2022-11-29T03:38:28Z)
- Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
 Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
 arXiv  Detail & Related papers  (2021-11-16T07:55:43Z)
- Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
 Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
 arXiv  Detail & Related papers  (2020-10-05T14:36:59Z)
- Channel-Directed Gradients for Optimization of Convolutional Neural
  Networks [50.34913837546743]
 We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
 arXiv  Detail & Related papers  (2020-08-25T00:44:09Z)
- Understanding Gradient Clipping in Private SGD: A Geometric Perspective [68.61254575987013]
 Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information.
Many learning systems now incorporate differential privacy by training their models with (differentially) private SGD.
A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its L2 norm exceeds some threshold.
 arXiv  Detail & Related papers  (2020-06-27T19:08:12Z)
- Cogradient Descent for Bilinear Optimization [124.45816011848096]
 We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
 arXiv  Detail & Related papers  (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.