Momentum Provably Improves Error Feedback!
- URL: http://arxiv.org/abs/2305.15155v2
- Date: Mon, 30 Oct 2023 16:18:44 GMT
- Title: Momentum Provably Improves Error Feedback!
- Authors: Ilyas Fatkhullin, Alexander Tyurin, Peter Richt\'arik
- Abstract summary: When untreated, errors caused by compression propagate exponential training behavior.
EF21-SGDM improves the communication and sample complexities of previous error feedback algorithms.
- Score: 54.93799845077906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the high communication overhead when training machine learning models
in a distributed environment, modern algorithms invariably rely on lossy
communication compression. However, when untreated, the errors caused by
compression propagate, and can lead to severely unstable behavior, including
exponential divergence. Almost a decade ago, Seide et al [2014] proposed an
error feedback (EF) mechanism, which we refer to as EF14, as an immensely
effective heuristic for mitigating this issue. However, despite steady
algorithmic and theoretical advances in the EF field in the last decade, our
understanding is far from complete. In this work we address one of the most
pressing issues. In particular, in the canonical nonconvex setting, all known
variants of EF rely on very large batch sizes to converge, which can be
prohibitive in practice. We propose a surprisingly simple fix which removes
this issue both theoretically, and in practice: the application of Polyak's
momentum to the latest incarnation of EF due to Richt\'{a}rik et al. [2021]
known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves
the communication and sample complexities of previous error feedback algorithms
under standard smoothness and bounded variance assumptions, and does not
require any further strong assumptions such as bounded gradient dissimilarity.
Moreover, we propose a double momentum version of our method that improves the
complexities even further. Our proof seems to be novel even when compression is
removed from the method, and as such, our proof technique is of independent
interest in the study of nonconvex stochastic optimization enriched with
Polyak's momentum.
Related papers
- Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness
Constants [4.2177789094825515]
We study a modern form of error feedback called EF21 (Richtarik et al., 2021)
In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean.
We continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method.
arXiv Detail & Related papers (2024-02-16T15:55:59Z) - Communication Compression for Byzantine Robust Learning: New Efficient
Algorithms and Improved Rates [9.965047642855074]
Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems.
We propose a new Byzantine-robust compression method with better convergence rate.
We also develop the first Byzantine-robust method with communication compression error feedback.
arXiv Detail & Related papers (2023-10-15T11:22:34Z) - Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning [47.904127007515925]
We study a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction.
We prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic approximation guarantees as their counterparts.
Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling.
arXiv Detail & Related papers (2023-01-03T04:09:38Z) - Adversarial Robustness with Semi-Infinite Constrained Learning [177.42714838799924]
Deep learning to inputs perturbations has raised serious questions about its use in safety-critical domains.
We propose a hybrid Langevin Monte Carlo training approach to mitigate this issue.
We show that our approach can mitigate the trade-off between state-of-the-art performance and robust robustness.
arXiv Detail & Related papers (2021-10-29T13:30:42Z) - EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern
Error Feedback [11.899559337707112]
Existing theory of error feedback (EF) relies on very strong assumptions and provides pessimistic convergence rates.
Richtarik et al. (2021) proposed a new error feedback mechanism, EF21, based on the construction of a compressor induced by a contractive approximation.
In this paper we propose six practical extensions of EF21, all supported by strong convergence theory.
arXiv Detail & Related papers (2021-10-07T09:29:14Z) - EF21: A New, Simpler, Theoretically Better, and Practically Faster Error
Feedback [0.0]
Error feedback (EF) is an immensely popular stabilization mechanism in the context of distributed training of supervised machine learning.
We propose and analyze a new EF mechanism, which we call EF21, which consistently and substantially outperforms in practice.
In particular, we prove that EF21 enjoys a fast $O(1/T)$ convergence rate for smooth non convergence problems.
arXiv Detail & Related papers (2021-06-09T16:45:53Z) - Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms.
Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent.
underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - Targeted free energy estimation via learned mappings [66.20146549150475]
Free energy perturbation (FEP) was proposed by Zwanzig more than six decades ago as a method to estimate free energy differences.
FEP suffers from a severe limitation: the requirement of sufficient overlap between distributions.
One strategy to mitigate this problem, called Targeted Free Energy Perturbation, uses a high-dimensional mapping in configuration space to increase overlap.
arXiv Detail & Related papers (2020-02-12T11:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.