EF21: A New, Simpler, Theoretically Better, and Practically Faster Error
Feedback
- URL: http://arxiv.org/abs/2106.05203v1
- Date: Wed, 9 Jun 2021 16:45:53 GMT
- Title: EF21: A New, Simpler, Theoretically Better, and Practically Faster Error
Feedback
- Authors: Peter Richt\'arik and Igor Sokolov and Ilyas Fatkhullin
- Abstract summary: Error feedback (EF) is an immensely popular stabilization mechanism in the context of distributed training of supervised machine learning.
We propose and analyze a new EF mechanism, which we call EF21, which consistently and substantially outperforms in practice.
In particular, we prove that EF21 enjoys a fast $O(1/T)$ convergence rate for smooth non convergence problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Error feedback (EF), also known as error compensation, is an immensely
popular convergence stabilization mechanism in the context of distributed
training of supervised machine learning models enhanced by the use of
contractive communication compression mechanisms, such as Top-$k$. First
proposed by Seide et al (2014) as a heuristic, EF resisted any theoretical
understanding until recently [Stich et al., 2018, Alistarh et al., 2018].
However, all existing analyses either i) apply to the single node setting only,
ii) rely on very strong and often unreasonable assumptions, such global
boundedness of the gradients, or iterate-dependent assumptions that cannot be
checked a-priori and may not hold in practice, or iii) circumvent these issues
via the introduction of additional unbiased compressors, which increase the
communication cost. In this work we fix all these deficiencies by proposing and
analyzing a new EF mechanism, which we call EF21, which consistently and
substantially outperforms EF in practice. Our theoretical analysis relies on
standard assumptions only, works in the distributed heterogeneous data setting,
and leads to better and more meaningful rates. In particular, we prove that
EF21 enjoys a fast $O(1/T)$ convergence rate for smooth nonconvex problems,
beating the previous bound of $O(1/T^{2/3})$, which was shown a bounded
gradients assumption. We further improve this to a fast linear rate for PL
functions, which is the first linear convergence result for an EF-type method
not relying on unbiased compressors. Since EF has a large number of
applications where it reigns supreme, we believe that our 2021 variant, EF21,
can a large impact on the practice of communication efficient distributed
learning.
Related papers
- Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.
Non-smooth regularization is often incorporated into machine learning tasks.
We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.
We show that the widely used beam search method suffers from unacceptable over-optimism.
We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - MARINA-P: Superior Performance in Non-smooth Federated Optimization with Adaptive Stepsizes [57.24311218570012]
We extend the non-smooth convex theory of EF21-P (Anonymous 2024) and MARINA-P (arXiv:2402.06412) in the non-size convex setting.
We provide theoretical guarantees under constant, decreasing, and adaptive (aktypetype) steps.
arXiv Detail & Related papers (2024-12-22T16:18:34Z) - A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness
Constants [4.2177789094825515]
We study a modern form of error feedback called EF21 (Richtarik et al., 2021)
In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean.
We continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method.
arXiv Detail & Related papers (2024-02-16T15:55:59Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Momentum Provably Improves Error Feedback! [54.93799845077906]
When untreated, errors caused by compression propagate exponential training behavior.
EF21-SGDM improves the communication and sample complexities of previous error feedback algorithms.
arXiv Detail & Related papers (2023-05-24T13:52:02Z) - Analysis of Error Feedback in Federated Non-Convex Optimization with
Biased Compression [37.6593006747285]
In learning server (FL) systems, the communication cost between the clients and the central bottleneck can be high.
In this paper, we propose a technique to remedy the downsides of biased compression.
Under partial participation, we develop an extra slow-down factor due to a so-called stale error accumulation'' effect.
arXiv Detail & Related papers (2022-11-25T18:49:53Z) - Optimizing Two-way Partial AUC with an End-to-end Framework [154.47590401735323]
Area Under the ROC Curve (AUC) is a crucial metric for machine learning.
Recent work shows that the TPAUC is essentially inconsistent with the existing Partial AUC metrics.
We present the first trial in this paper to optimize this new metric.
arXiv Detail & Related papers (2022-06-23T12:21:30Z) - Permutation Compressors for Provably Faster Distributed Nonconvex
Optimization [68.8204255655161]
We show that the MARINA method of Gorbunov et al (2021) can be considered as a state-of-the-art method in terms of theoretical communication complexity.
Theory of MARINA to support the theory of potentially em correlated compressors, extends to the method beyond the classical independent compressors setting.
arXiv Detail & Related papers (2021-10-07T09:38:15Z) - EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern
Error Feedback [11.899559337707112]
Existing theory of error feedback (EF) relies on very strong assumptions and provides pessimistic convergence rates.
Richtarik et al. (2021) proposed a new error feedback mechanism, EF21, based on the construction of a compressor induced by a contractive approximation.
In this paper we propose six practical extensions of EF21, all supported by strong convergence theory.
arXiv Detail & Related papers (2021-10-07T09:29:14Z) - A Better Alternative to Error Feedback for Communication-Efficient
Distributed Learning [0.0]
We show that our approach leads to vast improvements over EF, including reduced memory requirements, better complexity guarantees and fewer assumptions.
We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits.
arXiv Detail & Related papers (2020-06-19T11:24:41Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.