Step-Ahead Error Feedback for Distributed Training with Compressed
Gradient
- URL: http://arxiv.org/abs/2008.05823v3
- Date: Mon, 24 Jan 2022 15:49:53 GMT
- Title: Step-Ahead Error Feedback for Distributed Training with Compressed
Gradient
- Authors: An Xu, Zhouyuan Huo, Heng Huang
- Abstract summary: We show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training.
We propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis.
- Score: 99.42912552638168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the distributed machine learning methods can speed up the training
of large deep neural networks, the communication cost has become the
non-negligible bottleneck to constrain the performance. To address this
challenge, the gradient compression based communication-efficient distributed
learning methods were designed to reduce the communication cost, and more
recently the local error feedback was incorporated to compensate for the
corresponding performance loss. However, in this paper, we will show that a new
"gradient mismatch" problem is raised by the local error feedback in
centralized distributed training and can lead to degraded performance compared
with full-precision training. To solve this critical problem, we propose two
novel techniques, 1) step ahead and 2) error averaging, with rigorous
theoretical analysis. Both our theoretical and empirical results show that our
new methods can handle the "gradient mismatch" problem. The experimental
results show that we can even train faster with common gradient compression
schemes than both the full-precision training and local error feedback
regarding the training epochs and without performance loss.
Related papers
- Simplicity bias and optimization threshold in two-layer ReLU networks [24.43739371803548]
We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data.
Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
arXiv Detail & Related papers (2024-10-03T09:58:57Z) - Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning [15.78336840511033]
This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios.
Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates.
We employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity.
arXiv Detail & Related papers (2024-08-25T09:30:34Z) - Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - Balance is Essence: Accelerating Sparse Training via Adaptive Gradient
Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs.
Sparse training is one of the most common techniques to reduce these costs.
In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z) - Adversarial Coreset Selection for Efficient Robust Training [11.510009152620666]
We show how selecting a small subset of training data provides a principled approach to reducing the time complexity of robust training.
We conduct extensive experiments to demonstrate that our approach speeds up adversarial training by 2-3 times.
arXiv Detail & Related papers (2022-09-13T07:37:53Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Balanced Softmax Cross-Entropy for Incremental Learning [6.5423218639215275]
Deep neural networks are prone to catastrophic forgetting when incrementally trained on new classes or new tasks.
Recent methods has proven to be effective to mitigate catastrophic forgetting.
We propose the use of the Balanced Softmax Cross-Entropy loss and show that it can be combined with exiting methods for incremental learning to improve their performances.
arXiv Detail & Related papers (2021-03-23T13:30:26Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - How Important is the Train-Validation Split in Meta-Learning? [155.5088631672781]
A common practice in meta-learning is to perform a train-validation split (emphtrain-val method) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice.
We show that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
arXiv Detail & Related papers (2020-10-12T16:48:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.